Identifying Smart Living Partnership Opportunities Through Segmentation for RBC Home and Property Insurance
M733 Term Project
NB: This document filters Ontario FSALDU rows from the raw DemoStats2020 data
pacman::p_load(psych, xray , DT, excelR,
sjmisc, sjlabelled, sjstats, sjPlot, dplyr,
knitr, kableExtra, captioner)2 Executive Summary
Using K-means clustering and a complimentary GLM logistic regression model, 4 segments were identified as key segments to target for home and property insurance partnerhsip purposes, and they include:
Segment 1: Open-Minded Renters | These individuals (26-64) are the more tech enthusiastic out of all renters, however their smart tech adoption is medium, having smart phones, smart thermostats, and smart home assistants. Although they are excited about technology, they are not likely to purchase smart home technologies in the future. This could be because there is not enough incentive for them to do so (e.g.lowering insurance premiums, simplifying their life, etc.). There is also a huge opportunity to educate these individuals on the ease of use of these technologies as they are middle-aged and live busy lives with most having one to two children in an apartment. While they don’t plan on owning smart home technology, they are considered to be open-minded individuals as they are the most willing segment to share their mobile information, are enthusiastic about technology in general, trust banking/financial apps, and are therefore likely to convert. Overall, they are technologically-enabled, but need that extra push to continue to purchase more smart home technology. Location : Scattered all across Ontario
Segment 2: Low-tech Homeowners | This segment owns homes and falls in the 40-64 age bracket, and are not only slow to adopting smart home technology, but don’t plan on purchasing it in the future. They are home owners and live in 3-4 person households and given that they have medium-to-high home insurance premiums, there is an opportunity to educate them on the impact that smart home technology could have on their premiums. They are not tech enthusiasts, but they currently own smart phone and smart home assistants. Interestingly, they are most likley to consent to mobile information sharing. Location : Northern and Southern Ontario
Segment 3: Migrant Technophiles | These individuals are very excited by technology. They own the most amount of smart technologies, including: smart phone, smart home thermostat, smart home security system, smart home assistant, smart home appliances and lighting. They are also likely to purchase more smart home technology in the future (specifically smart home assistants). Most have 2 children per household, are employed and have a university degrees as their highest level of education (the highest amongst the 4 segments). Notably, they have the highest combined household income of roughly $148,000. They are also the most confident in big businesses as well as banking/financial applications, which is a plus for RBCI. Location : Southern Ontario
Segment 4: Relaxed Retirees | This segment is 65+ and pays high insurance premiums, but is slow to smart home technology adoption. Most have a smart thermostat and a smart home assistant, but don’t plan on purchasing more smart home technology in the future. These individuals (or their children) need a lot of education in order for them to become more technologically-enabled. Considering that they have the highest insurance premiums the relaxed retirees, emphasize opportunities to decrease insurance premiums with the adoption of certain smart home technologies. They are also confident in big businesses, which is aplus for RBCI. Location : In or surounsding cities (non-rural)
These segments will be communicated to using specific messaging and marketing channels, which are detaild further in the document.
Partnership opportunities that are detailed towards the end of the document include: Google Nest, Ring, Amazon Echo, and General Electric.
3 Background
People don’t wake up thinking “I need to go buy insurance”. As such RBC Insurance (RBCI) has tasked us with investigating where insurance can be logically placed and contribute to the client value proposition. Smart devices and home automation are growing industries and very important to the insurance business as a mechanism to help engage and educate consumers, prevent loss, and provide data and experience to help price risk.
With new entrants delivering benefits to our clients in many ways, some smart and connected devices provide a logical connection with the insurance value proposition. To assess a potential partnership ecosystem around smart home automation, RBCI wants to understand which market segments are interested in smart technology, the type of technology they are interested in and where they are located across Canada.
With that being said, the project scope, or business opportunity, has been narrowed down to the following: By leveraging smart home technology, who in Ontario should RBC Insurance target for its home & property insurance solutions?
Environics Analytics has kindly shared multiple datsets with our team, and they included 6,375 variables. Our team has narrowed down the scope of the project on home and property insurance. The resulting variables list has therefore been shortened to include the following 32 variables :
Technology Variables
- Technology Enthusiast - Are you enthusiastic about technology?
- Mobile Device Marketing Consent - Do you consent to mobile marketing?
- Mobile Information Sharing Consent - Do you often consent to sharing your mobile information?
- Smartphone Purchase People - Do you make purchases on your smart phone?
- Smart Thermostat Owners - Do you currently own a smart thermostat?
- Smart Home Security Owners - Do you currently have smart home security?
- Smart Home Assistant Owners - Do you currently own a smart home assistant?
- Smart Home Devices Owners - Do you currently own smart home devices?
- Smart Thermostat Owners (Planning) - Do you plan to purchase a smart thermostat?
- Smart Home Security Owners (Planning) - Do you plan to purchase smart home security systems/smartlocks?
- Smart Home Assistant Owners (Planning) - Do you plan to purchase smart home assistants (e.g. Google Home, Amazon Echo, etc)?
- Smart Home Devices Owners (Planning) - Do you plan to purchase a smart home devices, appliances, lighting?
- Financial App Trust - Do you trust Banking/Financial Apps?
Socio-Demographic Variables
- Total Population Median Age
- Male Population Median Age
- Female Population Median Age
- Average Number Of Persons In Private Households
- Total Family Households
- Non-Family Households
- Average Children Per Census Family Household
- Houses
- Apartment, Building Low And High Rise
- Condos
- Median Household Income
- High School Certificate Or Equivalent
- College, CEGEP Or Other Non-University Certificate Or Diploma
- University Certificate Or Diploma Below Bachelor
- Employed
- Unemployed
- Total Immigrant
Insurance Variables
- Annual Premiums - Home Insurance - What are you annual premiums for home insurance?
Business Variables * Big Business Confidence - Are you confident in Big Businesses?
4 Objective
The objective of this project is to use the Environics Analytics Data to perform a segmentation analysis and to ultimately build segment profile and recommend opportunities for smart technology partnerships for RBCI. The locations and marketing strategies for these segments will also be identified.
5 Methods
- The initial Environics Analytics datasets were analyzed to identify 33 relevant variables for the project. These variables were then combined into one csv file, called “Project_Cleaned_File.csv”. Some variables needed to be divided by “Total Population” or “Total Household”.
- The combined dataset of 33 pre-selected variables was read into R Markdown. Data was cleaned, variables were re-ordered and variable names were renamed.
- Following this, a segmentation analysis was performed using K-means Clustering.
- Level 1 segmentation was performed using attitude variables (predominately technology-related variables).
- ML models (GBM, GLM, and RF) were performed using h20 and the quality of them was assessed.
- Level 2 segmentation was performed using demographic variables.
- ML models (GBM, GLM, Deep Learning, and RF) were performed using h20 and the quality of them was assessed.
- Segment Persona’s were created.
- Some visualizations of the selected model were then prepared.
- Given the results of the above analyses, a final recommendation was provided.
6 Analysis
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
C:\Users\jaspr\AppData\Local\Temp\RtmpSabFlD\file2388650737da/h2o_jaspr_started_from_r.out
C:\Users\jaspr\AppData\Local\Temp\RtmpSabFlD\file2388129c246f/h2o_jaspr_started_from_r.err
Starting H2O JVM and connecting: Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 seconds 339 milliseconds
H2O cluster timezone: America/Toronto
H2O data parsing timezone: UTC
H2O cluster version: 3.32.0.1
H2O cluster version age: 2 months and 9 days
H2O cluster name: H2O_started_from_R_jaspr_cyu991
H2O cluster total nodes: 1
H2O cluster total memory: 3.95 GB
H2O cluster total cores: 12
H2O cluster allowed cores: 12
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
R Version: R version 4.0.3 (2020-10-10)
6.1 Reading the data provided by Environics Analytics (EA).
We will be using H2o package to read the data provided by EA.
6.2 Finding the column numbers of the variables identified as basis variables
Since there are many variables present in each data sheet, we need to select only the required variables. In the Next steps, we are selecting the required variables from each data file for further operations.
6.2.2 DemoStats
Just repeat the same sequence of operations on each of the remaining data files.
df2.hex<-h2o.importFile("C:/Jaspreet Marketing Analytics Project/EA 2020 Data/DemoStats2020_fsaldu_05nov20.csv")
|
| | 0%
|
|=============================== | 44%
|
|================================================ | 69%
|
|================================================================ | 92%
|
|======================================================================| 100%
[1] 1
[1] 5
[1] 42
[1] 63
[1] 84
[1] 165
[1] 167
[1] 170
[1] 255
[1] 274
[1] 278
[1] 286
[1] 323
[1] 331
[1] 333
[1] 335
[1] 349
[1] 350
[1] 540
df2s.hex<- as.h2o(df2.hex[, c(1,5,42,63,84,165,167,170,255,274,278,286,323,331,333,335,349,350,540 )]) # copy selected columns to new dataframe
names(df2s.hex) [1] "CODE" "ECYBASPOP" "ECYPTAMED" "ECYPMAMED" "ECYPFAMED"
[6] "ECYHSZAVG" "ECYHTYFHT" "ECYHTYNFH" "ECYCHAFHCH" "ECYSTYHOUS"
[11] "ECYSTYAPT" "ECYCDOCO" "ECYHNIMED" "ECYEDUHSCE" "ECYEDUCOLL"
[16] "ECYEDUUD" "ECYACTEMP" "ECYACTUEMP" "ECYTIMIMGT"
6.2.3 Money Matters
df3.hex<-h2o.importFile("C:/Jaspreet Marketing Analytics Project/EA 2020 Data/MoneyMatters5_2020_fsaldu_06nov20.csv")
|
| | 0%
|
|============= | 19%
|
|=================================== | 50%
|
|========================================================== | 83%
|
|======================================================================| 100%
[1] 163
[1] 1
[1] "CFM0876I" "CODE"
6.2.4 Mobile 1
df4.hex<-h2o.importFile("C:/Jaspreet Marketing Analytics Project/EA 2020 Data/Mobile1_2020_fsaldu_06nov20.csv")
|
| | 0%
|
|============================================================ | 85%
|
|======================================================================| 100%
[1] 41
[1] 42
[1] 87
[1] 1
df4s.hex<- as.h2o(df4.hex[, c(1,41,42,87)]) # copy selected columns to new dataframe
names(df4s.hex)[1] "CODE" "M221C45" "M222C45" "M3F3C3"
6.2.5 Mobile 2
df5.hex<-h2o.importFile("C:/Jaspreet Marketing Analytics Project/EA 2020 Data/Mobile2_2020_fsaldu_06nov20.csv")
|
| | 0%
|
|==================================== | 52%
|
|======================================================================| 100%
[1] 20
[1] 21
[1] 22
[1] 23
[1] 47
[1] 48
[1] 49
[1] 50
[1] 1
df5s.hex<- as.h2o(df5.hex[, c(1,20,21,22,23,47,48,49,50)]) # copy selected columns to new dataframe
names(df5s.hex)[1] "CODE" "M2X33C31" "M2X32C32" "M2X33C33" "M2X33C34" "M2Y33C31" "M2Y32C32"
[8] "M2Y33C33" "M2Y33C34"
6.3 Combining variables selected from the 6 mobile datasets
Now, the data frames are bound together to form one data frame.
[1] 285374 3
[1] 285374 19
[1] 285374 2
[1] 285374 4
[1] 285374 9
[1] 285374 2
df.hex<-merge( df1s.hex, df2s.hex, by= 'CODE') # BINDING TOGETHER THE BASIS VARIABLE DATA FRAMES
df.hex1<- merge(df.hex, df3s.hex, by='CODE')
df.hex2<- merge(df.hex1, df4s.hex, by='CODE')
df.hex3<- merge(df.hex2, df5s.hex, by='CODE')
df.hex4<- merge(df.hex3, df6s.hex, by='CODE')
rm( "df1s.hex", "df2s.hex", "df3s.hex", "df4s.hex", "df5s.hex" , "df6s.hex", "df.hex","df.hex1", "df.hex2", "df.hex3")6.4 Renaming and Reordering of Variables
Since the original variable names were not descriptive, and the full variable names are long, we’ve adjusted them to provide context in our analyses. You’ll see in the code below, but an example is “Big Business Confidence” to “Confidence”.
- SV00015 : Big Business Confidence –> Confidence
- SV00028 : Technology Enthusiast –> Tech Enthu.
- ECYPTAMED : Total Population Median Age –> Pop_Age
- ECYPMAMED : Male Population Median Age –> M_Pop_Age
- ECYPFAMED : Female Population Median Age –> F_Pop_Age
- ECYHSZAVG : Average Number Of Persons In Private Households –> **#_Pvt_HHld**
- ECYHTYFHT : Total Family Households –> **#_Fam_Hhld**
- ECYHTYNFH : Non-Family Households –> **#_NF_Hhld**
- ECYCHAFHCH : Average Children Per Census Family Household –> Avg_Chld_PrCen_Fm_Hhld
- ECYSTYHOUS : Houses –> Houses
- ECYSTYAPT : Apartment, Building Low And High Rise –> Apt
- ECYCDOCO : Condos –> Condos
- ECYHNIMED : Median Household Income –> Hhld Income
- ECYEDUHSCE : High School Certificate Or Equivalent –> High School
- ECYEDUCOLL : College, CEGEP Or Other Non-University Certificate Or Diploma –> Clg_CEGEP_Non_Uni_Dip
- ECYEDUUD : University Certificate Or Diploma Below Bachelor –> Uni_Dip_Blw_Bach
- ECYACTEMP : Employed –> Empl
- ECYACTUEMP : Unemployed –> UnEmp
- ECYTIMIMGT : Total Immigrant –> Tot_Mig
- CFM0876I : Annual Premiums - Home Insurance –> Ann_Prem_Hm_Insu
- M221C45 : Mobile Device Markeking Consent –> Mob_Inf_sec_cons
- M222C45 : Mobile Information Sharing Consent –> Mob_Inf_sec_cons
- M3F3C3 : Smartphone Purchase People –> #Smart_purch
- M2X33C31 : Smart Thermostat Owners –> #Smt_Therm
- M2X32C32 : Smart Home Security Owners –> #Smt_Hm_Secu
- M2X33C33 : Smart Home Assistant Owners –> #Smt_Hm_Ass
- M2X33C34 : Smart Home Devices Owners –> Smt_Hm_Dev
- M2Y33C31 : Smart Thermostat Owners- Planning –> #Smat_Therm-Pln
- M2Y32C32 : Smart Home Security Owners-Planning –> #Smt_Hm_Sec-Pln
- M2Y33C33 : Smart Home Assistant Owners-Planning –> #Smt_Hm_Asst-Pln
- M2Y33C34 : Smart Home Devices Owners-Planning –> #Smt_Hm_Dev-Pln
- M3Q241C45 : Financial App Trust –> Fin_App_Trst
The variables were also reordered so that technology-related ones were grouped together and demographic-related ones were grouped together. We can see through the structure of fs1 that all variables have been recoded correctly.
new_names<-c(SV00015="Confidence", SV00028= "Tech Enthu.",ECYPTAMED= "Pop_Age",ECYPMAMED= "M_Pop_Age",ECYPFAMED= "F_Pop_Age",ECYHSZAVG= "#_Pvt_HHld",ECYHTYFHT= "#_Fam_Hhld",ECYHTYNFH= "#_NF_Hhld",ECYCHAFHCH= "Avg_Chld_PrCen_Fm_Hhld",ECYSTYHOUS= "Houses",ECYSTYAPT= "Apt",ECYCDOCO= "Condos",ECYHNIMED="Hhld Income",ECYEDUHSCE="High School",ECYEDUCOLL="Clg_CEGEP_Non_Uni_Dip",ECYEDUUD="Uni_Dip_Blw_Bach",ECYACTEMP="Empl",ECYACTUEMP="UnEmp",ECYTIMIMGT="Tot_Mig",CFM0876I="Ann_Prem_Hm_Insu",M221C45="Mob_Mark_cons",M222C45="Mob_Inf_sec_cons",M3F3C3="# Smart_purch",M2X33C31="#Smt_Therm",M2X32C32="#Smt_Hm_Secu",M2X33C33="#Smt_Hm_Ass",M2X33C34="Smt_Hm_Dev",M2Y33C31="#Smat_Therm-Pln",M2Y32C32="#Smt_Hm_Sec-Pln",M2Y33C33="#Smt_Hm_Asst-Pln",M2Y33C34="#Smt_Hm_Dev-Pln",M3Q241C45="Fin_App_Trst",ECYBASPOP="Total_Pop")
df.hex4<-rename_columns(df.hex4,new_names)
colnames(df.hex4) [1] "CODE" "Confidence" "Tech Enthu."
[4] "Total_Pop" "Pop_Age" "M_Pop_Age"
[7] "F_Pop_Age" "#_Pvt_HHld" "#_Fam_Hhld"
[10] "#_NF_Hhld" "Avg_Chld_PrCen_Fm_Hhld" "Houses"
[13] "Apt" "Condos" "Hhld Income"
[16] "High School" "Clg_CEGEP_Non_Uni_Dip" "Uni_Dip_Blw_Bach"
[19] "Empl" "UnEmp" "Tot_Mig"
[22] "Ann_Prem_Hm_Insu" "Mob_Mark_cons" "Mob_Inf_sec_cons"
[25] "# Smart_purch" "#Smt_Therm" "#Smt_Hm_Secu"
[28] "#Smt_Hm_Ass" "Smt_Hm_Dev" "#Smat_Therm-Pln"
[31] "#Smt_Hm_Sec-Pln" "#Smt_Hm_Asst-Pln" "#Smt_Hm_Dev-Pln"
[34] "Fin_App_Trst"
Now, we are rearranging the column orders
col_order <- c("Confidence","Total_Pop","Tech Enthu.", "Ann_Prem_Hm_Insu", "Mob_Mark_cons","Mob_Inf_sec_cons","# Smart_purch","#Smt_Therm","#Smt_Hm_Secu","#Smt_Hm_Ass","Smt_Hm_Dev","#Smat_Therm-Pln","#Smt_Hm_Sec-Pln","#Smt_Hm_Asst-Pln","#Smt_Hm_Dev-Pln","Fin_App_Trst","CODE","Pop_Age","M_Pop_Age","F_Pop_Age","#_Pvt_HHld","#_Fam_Hhld","#_NF_Hhld","Avg_Chld_PrCen_Fm_Hhld","Houses","Apt","Condos","Hhld Income","High School","Clg_CEGEP_Non_Uni_Dip","Uni_Dip_Blw_Bach","Empl","UnEmp","Tot_Mig")
df.hex4 <- df.hex4[, col_order]
head(df.hex4) Confidence Total_Pop Tech Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 2917.6303 11875 1658.5059 2636.2583 2102.76
2 555.0760 3339 475.3049 1025.1245 469.62
3 769.6208 3327 986.1495 746.9732 352.56
4 430.9621 2474 369.0277 611.4645 386.08
5 390.9832 1973 403.9488 528.4261 270.98
6 2109.9184 10258 2077.7702 2450.7502 1112.58
Mob_Inf_sec_cons # Smart_purch #Smt_Therm #Smt_Hm_Secu #Smt_Hm_Ass Smt_Hm_Dev
1 1529.28 1529.28 1911.60 477.90 1816.02 1146.96
2 443.53 417.44 704.43 260.90 730.52 443.53
3 406.80 298.32 488.16 162.72 542.40 244.08
4 365.76 203.20 548.64 142.24 508.00 264.16
5 302.86 270.98 302.86 95.64 462.26 223.16
6 1748.34 1033.11 1907.28 1271.52 1827.81 1350.99
#Smat_Therm-Pln #Smt_Hm_Sec-Pln #Smt_Hm_Asst-Pln #Smt_Hm_Dev-Pln Fin_App_Trst
1 1433.70 764.64 955.80 955.80 4970.16
2 234.81 286.99 208.72 260.90 1330.59
3 271.20 352.56 162.72 216.96 1328.88
4 243.84 223.52 203.20 162.56 995.68
5 175.34 207.22 111.58 127.52 828.88
6 874.17 715.23 476.82 715.23 3894.03
CODE Pop_Age M_Pop_Age F_Pop_Age #_Pvt_HHld #_Fam_Hhld #_NF_Hhld
1 K0A1A0 46.1 44.9 47.7 2.6 3386 1064
2 K0A1B0 46.0 45.5 46.4 2.4 1143 220
3 K0A1E0 43.5 43.3 43.9 3.2 763 247
4 K0A1G0 43.9 43.1 45.1 3.0 626 187
5 K0A1K0 44.5 44.1 45.0 2.3 645 202
6 K0A1L0 40.9 39.9 41.8 2.8 3079 631
Avg_Chld_PrCen_Fm_Hhld Houses Apt Condos Hhld Income High School
1 0.9 3921 516 122 95077.29 2640
2 1.0 1355 5 1 129215.03 753
3 1.0 883 118 15 88144.35 860
4 1.0 772 38 4 105583.46 582
5 0.9 782 63 40 98552.99 490
6 1.1 3491 216 126 146759.85 2055
Clg_CEGEP_Non_Uni_Dip Uni_Dip_Blw_Bach Empl UnEmp Tot_Mig
1 2926 1988 6045 174 851
2 791 651 1817 115 305
3 762 448 1839 98 161
4 580 483 1315 59 129
5 467 337 1036 64 270
6 2156 2904 5805 248 1477
[1] "Confidence" "Total_Pop" "Tech Enthu."
[4] "Ann_Prem_Hm_Insu" "Mob_Mark_cons" "Mob_Inf_sec_cons"
[7] "# Smart_purch" "#Smt_Therm" "#Smt_Hm_Secu"
[10] "#Smt_Hm_Ass" "Smt_Hm_Dev" "#Smat_Therm-Pln"
[13] "#Smt_Hm_Sec-Pln" "#Smt_Hm_Asst-Pln" "#Smt_Hm_Dev-Pln"
[16] "Fin_App_Trst" "CODE" "Pop_Age"
[19] "M_Pop_Age" "F_Pop_Age" "#_Pvt_HHld"
[22] "#_Fam_Hhld" "#_NF_Hhld" "Avg_Chld_PrCen_Fm_Hhld"
[25] "Houses" "Apt" "Condos"
[28] "Hhld Income" "High School" "Clg_CEGEP_Non_Uni_Dip"
[31] "Uni_Dip_Blw_Bach" "Empl" "UnEmp"
[34] "Tot_Mig"
As some of the variables represent total households/population in a Postal Code, we are dividing those variables by the total population to get a common unit of each values as “per person”.
df.hex4$`Tech Enthu.`=df.hex4$`Tech Enthu.`/df.hex4$Total_Pop
df.hex4$Ann_Prem_Hm_Insu=df.hex4$Ann_Prem_Hm_Insu/df.hex4$Total_Pop
df.hex4$Mob_Mark_cons=df.hex4$Mob_Mark_cons/df.hex4$Total_Pop
df.hex4$Mob_Inf_sec_cons=df.hex4$Mob_Inf_sec_cons/df.hex4$Total_Pop
df.hex4$Smt_Hm_Dev=df.hex4$Smt_Hm_Dev/df.hex4$Total_Pop
df.hex4$Fin_App_Trst=df.hex4$Fin_App_Trst/df.hex4$Total_Pop
df.hex4$Houses=df.hex4$Houses/df.hex4$Total_Pop
df.hex4$Apt=df.hex4$Apt/df.hex4$Total_Pop
df.hex4$Condos=df.hex4$Condos/df.hex4$Total_Pop
df.hex4$`Hhld Income`=df.hex4$`Hhld Income`/df.hex4$Total_Pop
df.hex4$`High School`=df.hex4$`High School`/df.hex4$Total_Pop
df.hex4$Clg_CEGEP_Non_Uni_Dip=df.hex4$Clg_CEGEP_Non_Uni_Dip/df.hex4$Total_Pop
df.hex4$Uni_Dip_Blw_Bach=df.hex4$Uni_Dip_Blw_Bach/df.hex4$Total_Pop
df.hex4$Empl=df.hex4$Empl/df.hex4$Total_Pop
df.hex4$UnEmp=df.hex4$UnEmp/df.hex4$Total_Pop
df.hex4$Tot_Mig=df.hex4$Tot_Mig/df.hex4$Total_Pop
df.hex4$`# Smart_purch`=df.hex4$`# Smart_purch`/df.hex4$Total_Pop
df.hex4$`#Smt_Therm`=df.hex4$`#Smt_Therm`/df.hex4$Total_Pop
df.hex4$`#Smt_Hm_Secu`=df.hex4$`#Smt_Hm_Secu`/df.hex4$Total_Pop
df.hex4$`#Smt_Hm_Ass`=df.hex4$`#Smt_Hm_Ass`/df.hex4$Total_Pop
df.hex4$`#Smat_Therm-Pln`=df.hex4$`#Smat_Therm-Pln`/df.hex4$Total_Pop
df.hex4$`#Smt_Hm_Sec-Pln`=df.hex4$`#Smt_Hm_Sec-Pln`/df.hex4$Total_Pop
df.hex4$`#Smt_Hm_Asst-Pln`=df.hex4$`#Smt_Hm_Asst-Pln`/df.hex4$Total_Pop
df.hex4$`#Smt_Hm_Dev-Pln`=df.hex4$`#Smt_Hm_Dev-Pln`/df.hex4$Total_Pop
df.hex4$Confidence=df.hex4$Confidence/df.hex4$Total_Pop Confidence Total_Pop Tech Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.2456952 11875 0.1396637 0.2220007 0.1770745
2 0.1662402 3339 0.1423495 0.3070154 0.1406469
3 0.2313258 3327 0.2964080 0.2245186 0.1059693
4 0.1741965 2474 0.1491624 0.2471562 0.1560550
5 0.1981669 1973 0.2047384 0.2678287 0.1373441
6 0.2056852 10258 0.2025512 0.2389111 0.1084597
Mob_Inf_sec_cons # Smart_purch #Smt_Therm #Smt_Hm_Secu #Smt_Hm_Ass Smt_Hm_Dev
1 0.1287815 0.12878147 0.1609768 0.04024421 0.1529280 0.09658611
2 0.1328332 0.12501947 0.2109704 0.07813717 0.2187841 0.13283318
3 0.1222723 0.08966637 0.1467268 0.04890893 0.1630298 0.07336339
4 0.1478416 0.08213420 0.2217623 0.05749394 0.2053355 0.10677445
5 0.1535023 0.13734415 0.1535023 0.04847440 0.2342930 0.11310694
6 0.1704367 0.10071261 0.1859310 0.12395399 0.1781839 0.13170111
#Smat_Therm-Pln #Smt_Hm_Sec-Pln #Smt_Hm_Asst-Pln #Smt_Hm_Dev-Pln Fin_App_Trst
1 0.12073263 0.06439074 0.08048842 0.08048842 0.4185398
2 0.07032345 0.08595088 0.06250973 0.07813717 0.3984996
3 0.08151488 0.10596934 0.04890893 0.06521190 0.3994229
4 0.09856103 0.09034762 0.08213420 0.06570736 0.4024576
5 0.08886974 0.10502788 0.05655347 0.06463254 0.4201115
6 0.08521837 0.06972412 0.04648275 0.06972412 0.3796091
CODE Pop_Age M_Pop_Age F_Pop_Age #_Pvt_HHld #_Fam_Hhld #_NF_Hhld
1 K0A1A0 46.1 44.9 47.7 2.6 3386 1064
2 K0A1B0 46.0 45.5 46.4 2.4 1143 220
3 K0A1E0 43.5 43.3 43.9 3.2 763 247
4 K0A1G0 43.9 43.1 45.1 3.0 626 187
5 K0A1K0 44.5 44.1 45.0 2.3 645 202
6 K0A1L0 40.9 39.9 41.8 2.8 3079 631
Avg_Chld_PrCen_Fm_Hhld Houses Apt Condos Hhld Income
1 0.9 0.3301895 0.043452632 0.0102736842 8.006509
2 1.0 0.4058101 0.001497454 0.0002994909 38.698721
3 1.0 0.2654043 0.035467388 0.0045085663 26.493643
4 1.0 0.3120453 0.015359741 0.0016168149 42.677227
5 0.9 0.3963507 0.031931069 0.0202736949 49.950831
6 1.1 0.3403198 0.021056736 0.0122830961 14.306868
High School Clg_CEGEP_Non_Uni_Dip Uni_Dip_Blw_Bach Empl UnEmp
1 0.2223158 0.2464000 0.1674105 0.5090526 0.01465263
2 0.2255166 0.2368973 0.1949686 0.5441749 0.03444145
3 0.2584911 0.2290352 0.1346558 0.5527502 0.02945597
4 0.2352466 0.2344382 0.1952304 0.5315279 0.02384802
5 0.2483528 0.2366954 0.1708059 0.5250887 0.03243791
6 0.2003314 0.2101774 0.2830961 0.5658998 0.02417625
Tot_Mig
1 0.07166316
2 0.09134471
3 0.04839194
4 0.05214228
5 0.13684744
6 0.14398518
|
| | 0%
|
|======================================================================| 100%
Label Type Missing Zeros PosInf NegInf Min
1 Confidence real 30400 2229 0 0 0
2 Total_Pop int 0 30400 0 0 0
3 Tech Enthu. real 30400 2229 0 0 0
4 Ann_Prem_Hm_Insu real 30400 2229 0 0 0
5 Mob_Mark_cons real 30400 2248 0 0 0
6 Mob_Inf_sec_cons real 30400 2248 0 0 0
7 # Smart_purch real 30400 2248 0 0 0
8 #Smt_Therm real 30400 2248 0 0 0
9 #Smt_Hm_Secu real 30400 2353 0 0 0
10 #Smt_Hm_Ass real 30400 2248 0 0 0
11 Smt_Hm_Dev real 30400 2248 0 0 0
12 #Smat_Therm-Pln real 30400 2251 0 0 0
13 #Smt_Hm_Sec-Pln real 30400 2270 0 0 0
14 #Smt_Hm_Asst-Pln real 30400 2271 0 0 0
15 #Smt_Hm_Dev-Pln real 30400 2251 0 0 0
16 Fin_App_Trst real 30400 2248 0 0 0
17 CODE string 0 0 0 0 NaN
18 Pop_Age real 0 30400 0 0 0
19 M_Pop_Age real 0 33420 0 0 0
20 F_Pop_Age real 0 49473 0 0 0
21 #_Pvt_HHld real 0 32629 0 0 0
22 #_Fam_Hhld int 0 33957 0 0 0
23 #_NF_Hhld int 0 61796 0 0 0
24 Avg_Chld_PrCen_Fm_Hhld real 0 37183 0 0 0
25 Houses real 30400 7985 0 0 0
26 Apt real 30400 107710 0 0 0
27 Condos real 30400 186082 0 0 0
28 Hhld Income real 30400 2229 0 0 0
29 High School real 30400 6110 0 0 0
30 Clg_CEGEP_Non_Uni_Dip real 30400 10660 0 0 0
31 Uni_Dip_Blw_Bach real 30400 20955 0 0 0
32 Empl real 30400 2404 0 0 0
33 UnEmp real 30400 102846 0 0 0
34 Tot_Mig real 30400 23913 0 0 0
Max Mean Sigma Cardinality
1 0.4615281 0.22364379 0.05600671 NA
2 17176.0000000 51.39055065 184.71751083 NA
3 0.4084117 0.21796779 0.05953486 NA
4 0.8051892 0.21656763 0.07575771 NA
5 0.3200000 0.13348765 0.03520981 NA
6 0.4500000 0.18203466 0.04390049 NA
7 0.4947183 0.14579603 0.05531290 NA
8 0.5075000 0.18677906 0.06775226 NA
9 0.3200000 0.08309306 0.04106577 NA
10 0.4800000 0.20590744 0.05357666 NA
11 0.3794595 0.11621416 0.04169025 NA
12 0.2520000 0.09426797 0.03153806 NA
13 0.3208333 0.11155460 0.04268567 NA
14 0.2818182 0.07922398 0.03198151 NA
15 0.3051163 0.09788840 0.03487279 NA
16 0.7200000 0.43228969 0.07519694 NA
17 NaN NA NA NA
18 90.2000000 38.46463833 16.20522133 NA
19 90.2000000 37.45517321 16.81544034 NA
20 90.2000000 36.64113724 20.61625165 NA
21 15.0000000 2.71663221 1.54614338 NA
22 5659.0000000 13.59636477 53.28873497 NA
23 1450.0000000 5.87630618 22.65970172 NA
24 11.3000000 1.08344383 0.68519524 NA
25 1.0000000 0.27825254 0.12110564 NA
26 1.0000000 0.08330559 0.14144898 NA
27 1.0000000 0.02688702 0.07406307 NA
28 1503523.0100000 6903.10433236 15586.33001756 NA
29 1.0000000 0.23838485 0.08775688 NA
30 1.0000000 0.18352363 0.07958586 NA
31 1.0000000 0.23445186 0.15831378 NA
32 1.0000000 0.51589893 0.15144526 NA
33 1.0000000 0.03369596 0.04612460 NA
34 1.0000000 0.24478782 0.18853522 NA
Writing the combined data frame to a file so that the work above does not need to be repeated.
Note: The final csv generated here contains a few rows with 0 values in all the columns. So, we have deleted those few rows manually and included the final “Project_Merged_File” in the zip document itself.
6.5 Segmentation Analysis
6.5.1 Reading in the Data
Initially we started with reading in the variables from our CSV File.
6.5.2 Missing Values : XRAY
One of the most difficult realities of working with data is missing values. Through using the xray package, there were no missing values to deal with.
$variables
Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf
1 Condos 252726 0 - 183834 72.74% 0 - 0 -
2 Apt 252726 0 - 105463 41.73% 0 - 0 -
3 UnEmp 252726 0 - 100598 39.81% 0 - 0 -
4 X._NF_Hhld 252726 0 - 29148 11.53% 0 - 0 -
5 Tot_Mig 252726 0 - 21665 8.57% 0 - 0 -
6 F_Pop_Age 252726 0 - 19015 7.52% 0 - 0 -
7 Uni_Dip_Blw_Bach 252726 0 - 18712 7.4% 0 - 0 -
8 Clg_CEGEP_Non_Uni_Dip 252726 0 - 8413 3.33% 0 - 0 -
9 Houses 252726 0 - 5755 2.28% 0 - 0 -
10 Avg_Chld_PrCen_Fm_Hhld 252726 0 - 4553 1.8% 0 - 0 -
11 High.School 252726 0 - 3863 1.53% 0 - 0 -
12 M_Pop_Age 252726 0 - 2927 1.16% 0 - 0 -
13 X._Fam_Hhld 252726 0 - 1328 0.53% 0 - 0 -
14 Empl 252726 0 - 172 0.07% 0 - 0 -
15 X.Smt_Hm_Secu 252726 0 - 105 0.04% 0 - 0 -
16 X.Smt_Hm_Asst.Pln 252726 0 - 23 0.01% 0 - 0 -
17 X.Smt_Hm_Sec.Pln 252726 0 - 22 0.01% 0 - 0 -
18 X.Smat_Therm.Pln 252726 0 - 3 0% 0 - 0 -
19 X.Smt_Hm_Dev.Pln 252726 0 - 3 0% 0 - 0 -
20 X._Pvt_HHld 252726 0 - 0 - 0 - 0 -
21 Pop_Age 252726 0 - 0 - 0 - 0 -
22 Total_Pop 252726 0 - 0 - 0 - 0 -
23 Confidence 252726 0 - 0 - 0 - 0 -
24 Tech.Enthu. 252726 0 - 0 - 0 - 0 -
25 Ann_Prem_Hm_Insu 252726 0 - 0 - 0 - 0 -
26 Mob_Mark_cons 252726 0 - 0 - 0 - 0 -
27 Mob_Inf_sec_cons 252726 0 - 0 - 0 - 0 -
28 Smt_Hm_Dev 252726 0 - 0 - 0 - 0 -
29 Fin_App_Trst 252726 0 - 0 - 0 - 0 -
30 X.Smt_Hm_Ass 252726 0 - 0 - 0 - 0 -
31 X..Smart_purch 252726 0 - 0 - 0 - 0 -
32 X.Smt_Therm 252726 0 - 0 - 0 - 0 -
33 Hhld.Income 252726 0 - 0 - 0 - 0 -
34 CODE 252726 0 - 0 - 0 - 0 -
qDistinct type anomalous_percent
1 7567 Numeric 72.74%
2 10390 Numeric 41.73%
3 5136 Numeric 39.81%
4 479 Integer 11.53%
5 11811 Numeric 8.57%
6 723 Numeric 7.52%
7 10293 Numeric 7.4%
8 7046 Numeric 3.33%
9 8719 Numeric 2.28%
10 48 Numeric 1.8%
11 7634 Numeric 1.53%
12 668 Numeric 1.16%
13 804 Integer 0.53%
14 10525 Numeric 0.07%
15 48765 Numeric 0.04%
16 44903 Numeric 0.01%
17 48552 Numeric 0.01%
18 44468 Numeric 0%
19 45573 Numeric 0%
20 104 Numeric -
21 670 Numeric -
22 1608 Integer -
23 39750 Numeric -
24 39750 Numeric -
25 43899 Numeric -
26 44181 Numeric -
27 45403 Numeric -
28 48251 Numeric -
29 48628 Numeric -
30 49279 Numeric -
31 49827 Numeric -
32 54767 Numeric -
33 140106 Numeric -
34 252726 Character -
$problem_variables
[1] Variable q qNA pNA
[5] qZero pZero qBlank pBlank
[9] qInf pInf qDistinct type
[13] anomalous_percent problems
<0 rows> (or 0-length row.names)
6.6 Level 1 Segmentation
For the segmentation analyses, K-means has been used.
Level 1 segmentation was for attitude variables and they were predominately tehnology focused. There was also one variable about insurance premiums and another about confidence in big businesses.
The dataset that is now imported in the h20 cluster, still has all correctly recoded variables.
|
| | 0%
|
|======================================================================| 100%
Class 'H2OFrame' <environment: 0x000000002910d028>
- attr(*, "op")= chr "Parse"
- attr(*, "id")= chr "fs.class_sid_abe7_21"
- attr(*, "eval")= logi FALSE
- attr(*, "nrow")= int 252726
- attr(*, "ncol")= int 33
- attr(*, "types")=List of 33
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "string"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "int"
..$ : chr "int"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
..$ : chr "real"
- attr(*, "data")='data.frame': 10 obs. of 33 variables:
..$ Confidence : num 0.00943 0.02595 0.02524 0.03188 0.03043 ...
..$ Tech.Enthu. : num 0.00974 0.01475 0.02987 0.02482 0.02848 ...
..$ Ann_Prem_Hm_Insu : num 0.0044 0.00587 0.02074 0.0181 0.00905 ...
..$ Mob_Mark_cons : num 0.00565 0.01267 0.01556 0.01176 0.01877 ...
..$ Mob_Inf_sec_cons : num 0.00753 0.02139 0.02556 0.02706 0.02765 ...
..$ X..Smart_purch : num 0.00715 0.01426 0.01889 0.02118 0.02469 ...
..$ X.Smt_Therm : num 0.00979 0.02218 0.02889 0.02824 0.0237 ...
..$ X.Smt_Hm_Secu : num 0.00452 0.00713 0.01222 0.00941 0.01185 ...
..$ X.Smt_Hm_Ass : num 0.012 0.0253 0.0322 0.0353 0.0267 ...
..$ Smt_Hm_Dev : num 0.00715 0.01188 0.01889 0.02 0.01877 ...
..$ X.Smat_Therm.Pln : num 0.00452 0.01109 0.01111 0.01647 0.01778 ...
..$ X.Smt_Hm_Sec.Pln : num 0.00791 0.01267 0.01889 0.01412 0.02173 ...
..$ X.Smt_Hm_Asst.Pln : num 0.00264 0.00554 0.01111 0.01529 0.01284 ...
..$ X.Smt_Hm_Dev.Pln : num 0.00414 0.01109 0.00889 0.01294 0.0158 ...
..$ Fin_App_Trst : num 0.0188 0.0451 0.0567 0.0659 0.0583 ...
..$ CODE : chr "K7L5R2" "K9V6H4" "N2L2G4" "N2L2J2" ...
..$ Pop_Age : num 41.8 32.9 84.3 85 83.8 76.8 81.2 71.8 37.5 77
..$ M_Pop_Age : num 42.1 32.3 83.6 86.1 76.9 74 82.6 70.9 35 42.2
..$ F_Pop_Age : num 15 37.5 85 83.5 86 78.1 80 72.7 67.5 80.5
..$ X._Pvt_HHld : num 6 11 7 5 14 3.6 2 7 2 9
..$ X._Fam_Hhld : num 2 1 1 1 1 5 1 1 1 1
..$ X._NF_Hhld : num 1 0 0 0 0 9 0 0 0 0
..$ Avg_Chld_PrCen_Fm_Hhld: num 3 3 2 1 3 2.4 1 1 1 2
..$ Houses : num 0.00235 0.0099 0.02778 0.02941 0.01235 ...
..$ Apt : num 0.00235 0 0 0 0 ...
..$ Condos : num 0.00235 0 0 0 0 ...
..$ Hhld.Income : num 128 906 2407 1528 1765 ...
..$ High.School : num 0.00706 0.0396 0 0.02941 0.04938 ...
..$ Clg_CEGEP_Non_Uni_Dip : num 0.00941 0.0198 0 0.02941 0.01235 ...
..$ Uni_Dip_Blw_Bach : num 0.0141 0 0.0833 0.0294 0.0123 ...
..$ Empl : num 0.0118 0.0495 0.0556 0.0882 0.0864 ...
..$ UnEmp : num 0.00471 0 0.02778 0.02941 0 ...
..$ Tot_Mig : num 0.0235 0.0099 0.0278 0.0294 0.0741 ...
6.6.1 A 2 Segment Solution
Level 1 segmentation begins with a two-cluster solution and then moves ahead to 3 segments and so on until useful options were been identified. Two segments is just a first guess.
The dfK variable below holds the sums of squares for each segmentation solution. One key objective of segmentation is to maximize the differences among segments while having individuals within each segment to be very close. That translates into maximizing the between segments sum-of-squares and minimizing the within segments sums-of-squares.
dfK <- data.frame(Numbr_Segments=numeric(),
TotWithinSS=numeric(),
BetweenSS=numeric(),
TotSS=numeric(),
stringsAsFactors=FALSE) This code-chunk conducts the k-means analysis in h2o on all attitude variables. Using the variables, we wanted to see if they could help us partition the dataset into sections or clusters that are distinct from one another.
k_iterations = 2
fs.km2 <- h2o.kmeans(fs.class[, 1:15], k = k_iterations, estimate_k = TRUE,
init=c("PlusPlus"), standardize= FALSE,
score_each_iteration= FALSE, seed = 7238,
keep_cross_validation_predictions=TRUE)The following code-chunk saves key information from this solution in the dfK receptacle so that a comparative table can be produced when all rounds of partitioning are finished. The dfk object holds the number of iterations ‘within sum of squares’ (which we’re looking at minimizing) the ‘between sum of squares (which we’re looking at maximizing) and the ’total sum of squares’ (which will always be the same).
dfK[k_iterations-1,1] <- k_iterations
dfK[k_iterations-1,2] <- getTotWithinSS(fs.km2) # within segment sums of squares
dfK[k_iterations-1,3] <- getBetweenSS(fs.km2) # between segments sums of squares
dfK[k_iterations-1,4] <- getTotSS(fs.km2) # total sums of squaresOut of this two-segment k means (i.e. the segmentation clusters), there is 96002in one cluster and 156724 in the second cluster. They are uneven, but it’s nothing to be concerned about. With additial segments, hopefully they evenout more.
[1] 96002 156724
The following presents only cluster means, with centroid numbering for the variables. For example, the mean for variable “confidence” for segment 1 is 0.252506 and for segment 2 is 0.209111.
Cluster Means:
centroid confidence techenthu ann_prem_hm_insu mob_mark_cons mob_inf_sec_cons
1 1 0.252506 0.251570 0.189596 0.147918 0.196269
2 2 0.209111 0.200449 0.236151 0.126548 0.175912
xsmart_purch xsmt_therm xsmt_hm_secu xsmt_hm_ass smt_hm_dev xsmat_thermpln
1 0.184659 0.243259 0.117882 0.247712 0.150620 0.110301
2 0.124040 0.154800 0.062937 0.183209 0.096768 0.085782
xsmt_hm_secpln xsmt_hm_asstpln xsmt_hm_devpln fin_app_trst
1 0.145282 0.098794 0.120587 0.448208
2 0.092458 0.068352 0.085364 0.428726
Using ggplot, a plot of the centroids for the 2 segments over all used variables was created. You’ll notice that there is no ‘layering’ between the two segments in the centroid means plot. There is some cross over, which helps with describing persona profiles. We know that further segmentation willl yield better results, therefore we have stopped investing the 2 segment solution and have moved on to 3 segment clustering below this plot.
library(reshape2) # it's necessary to reshape the data into 'long' format
fs2c_long<- melt(fs.km2@model$centers ) # need to reshape to 'long' form
library(ggplot2)
ggplot(data=fs2c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (0-0.45)") +
ylim(0, 0.45) +
annotate("text", x = 3, y = 1, label = "Base = 1036", size=4) 6.6.2 A 3 segment solution
This code-chunk just changes the k-means command to 3 iterations, i.e., 3 segments. fs.km3 is the solution for 3 segments.
k_iterations = 3
fs.km3 <- h2o.kmeans( fs.class[,1:15], k=k_iterations, estimate_k = TRUE,
init=c("PlusPlus"), standardize= FALSE,
score_each_iteration= FALSE, seed = 7238,
keep_cross_validation_predictions=TRUE )The sizes of the three segmentation clusters are below. The sizes are now a lot more even at : 81147, 81866, and 89713.
dfK[k_iterations-1,1] <- k_iterations
dfK[k_iterations-1,2] <- getTotWithinSS(fs.km3) # within segment sums of squares
dfK[k_iterations-1,3] <- getBetweenSS(fs.km3) # between segments sums of squares
dfK[k_iterations-1,4] <- getTotSS(fs.km3) # total sums of squaresSizes of the segmentation clusters & cluster means, with centroid numbering is below.
[1] 81147 81866 89713
Cluster Means:
centroid confidence techenthu ann_prem_hm_insu mob_mark_cons mob_inf_sec_cons
1 1 0.256710 0.253889 0.179860 0.148702 0.195111
2 2 0.199922 0.196589 0.189364 0.122409 0.167667
3 3 0.220600 0.210086 0.279686 0.133024 0.187692
xsmart_purch xsmt_therm xsmt_hm_secu xsmt_hm_ass smt_hm_dev xsmat_thermpln
1 0.188097 0.250251 0.123119 0.250546 0.153713 0.110484
2 0.122583 0.143535 0.060426 0.170710 0.089837 0.079019
3 0.132027 0.172913 0.069328 0.202337 0.108916 0.095679
xsmt_hm_secpln xsmt_hm_asstpln xsmt_hm_devpln fin_app_trst
1 0.149757 0.101535 0.123214 0.441721
2 0.086693 0.065728 0.078356 0.396296
3 0.102135 0.073150 0.094995 0.467023
The following plot of the centroids for the 3 segments is slightly more interesting than was the 2-segment solution but we can still need to push further. Some initial observations:
- Segment 1 are the technophiles. They own a lot of smart technology and also plan on purchasing more. This segment is high on having enthusiasm for technology and are trusting of big businesses.
- Segment 2 are the least technologically-enabled out of the three segments. They do, however pay higher premiums for home insurance than Segment 1, the technophiles.
- Segment 3 is interesting because they pay the highest home insurance premiums by a long shot, but don’t own a lot of smart home technology. It will be interesting to see the level 2 segmentation, because these individuals may be older, have more money and thus larger more expensive houses that have higher premiums, but because they’re older, they may be somewhat technologically-disabled.
Below the plot, we have decided to push further with a 4-Segment Solution.
fs3c_long<- melt(fs.km3@model$centers ) # need to reshape to 'long' form
ggplot(data=fs3c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (0-0.45)") + ylim(0,0.45) +
annotate("text", x = 3, y = 1, label = "Base = 1036", size=4) 6.6.3 A 4 segment solution
As usual, the clustering algorithm groups the respondents.
#____k-means clustering, k=4 clusters
k_iterations <- 4 # MAXIMUM NUMBER OF SEGMENTS USED HERE
fs.km4 <- h2o.kmeans(fs.class[,1:15], k=k_iterations, estimate_k = FALSE,
init=c("PlusPlus"), standardize= FALSE,
score_each_iteration= FALSE, seed = 7238,
keep_cross_validation_predictions=TRUE) Saving the sums-of-squares to the dfK data frame.
dfK[k_iterations-1,1] <- k_iterations
dfK[k_iterations-1,2] <- getTotWithinSS(fs.km4) # within segment sums of squares
dfK[k_iterations-1,3] <- getBetweenSS(fs.km4) # between segments sums of squares
dfK[k_iterations-1,4] <- getTotSS(fs.km4) # total sums of squares The sizes of the four clusters are getting a little uneven again, but it’s nothing to be concerned about: 41569, 76774, 62595, and 71788.
[1] 41569 76774 62595 71788
The cluster means, with centroid numbering, are provided below.
Cluster Means:
centroid confidence techenthu ann_prem_hm_insu mob_mark_cons mob_inf_sec_cons
1 1 0.201426 0.258592 0.161547 0.144135 0.208495
2 2 0.204561 0.181748 0.218850 0.117299 0.158659
3 3 0.267740 0.249957 0.179796 0.149719 0.190807
4 4 0.225678 0.212453 0.286128 0.134919 0.190283
xsmart_purch xsmt_therm xsmt_hm_secu xsmt_hm_ass smt_hm_dev xsmat_thermpln
1 0.190667 0.147765 0.075679 0.193768 0.105557 0.093433
2 0.107581 0.150261 0.058678 0.171162 0.089400 0.078181
3 0.184109 0.265848 0.131476 0.258052 0.160900 0.111391
4 0.132095 0.185768 0.074036 0.211663 0.116090 0.100329
xsmt_hm_secpln xsmt_hm_asstpln xsmt_hm_devpln fin_app_trst
1 0.098799 0.080699 0.101414 0.462595
2 0.086462 0.062731 0.075302 0.392077
3 0.159152 0.105840 0.127678 0.431884
4 0.108018 0.075420 0.097391 0.473106
This plot of the centroids for the 4 segments shows much greater complexity than did the earlier solutions. Some observations:
- Segment 1 is the second most important group for RBC to target, based on their enthusiasm for technology, smarthome technology adoption, and plans to purchase more smart home technology. The are not extreme technophiles like Segment 3, but are the second most technologically-enabled segment.
- Segment 2 is the lowest priority segment to market to given the context of RBCI’s goal of smart technology partnership and marketing.They have the lowest adoption of smart home technology and the lowest plans to purchase the technology in the future. this segment could in fact encompass an elderly population, but this of course would need to be confirmed in the level 2 segmentation.
- Segment 3 is now the technophile group, with the most amount of confidence in big businesses and
- Segment 4 is now the group that pays high insurance premiums, but is slow to smart home technology adoption.
Before looking at comparisons of the sums-of-squares, we have looked into a 5-segment solution.
library(reshape2) # it's necessary to reshape the data into 'long' format
fs4c_long<- melt(fs.km4@model$centers ) # need to reshape to 'long' form
library(ggplot2)
ggplot(data=fs4c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (0-0.5)") + ylim(0, 0.5) +
annotate("text", x = 3, y = 1, label = "Base = 1036", size=4) ### A 5 segment solution
As usual, the clustering algorithm groups the respondents.
k_iterations = 5
fs.km5<- h2o.kmeans(fs.class[,1:15], k=k_iterations, estimate_k = FALSE,
init=c("PlusPlus"), standardize= FALSE,
score_each_iteration= FALSE, seed = 7238,
keep_cross_validation_predictions=TRUE) Saving the sums-of-squares to the dfk data frame
dfK[k_iterations-1,1] <- k_iterations
dfK[k_iterations-1,2] <- getTotWithinSS(fs.km5) # within segment sums of squares
dfK[k_iterations-1,3] <- getBetweenSS(fs.km5) # between segments sums of squares
dfK[k_iterations-1,4] <- getTotSS(fs.km5) # total sums of squaresThe sizes of the five clusters are getting a little uneven again, but it’s nothing to be concerned about: 28435 55811 37787 68582 and 62111.
[1] 28435 55811 37787 68582 62111
The Cluster means with centroid numbering are below.
Cluster Means:
centroid confidence techenthu ann_prem_hm_insu mob_mark_cons mob_inf_sec_cons
1 1 0.196001 0.256632 0.159647 0.149405 0.217220
2 2 0.226635 0.207035 0.302774 0.133919 0.189772
3 3 0.286595 0.254967 0.169591 0.157873 0.193669
4 4 0.202890 0.180364 0.214557 0.116860 0.157771
5 5 0.227482 0.237932 0.202917 0.134659 0.185620
xsmart_purch xsmt_therm xsmt_hm_secu xsmt_hm_ass smt_hm_dev xsmat_thermpln
1 0.200842 0.125183 0.067929 0.185842 0.099148 0.089929
2 0.123966 0.171386 0.066838 0.202734 0.106343 0.097778
3 0.196887 0.279565 0.146351 0.271448 0.169732 0.117141
4 0.106023 0.146765 0.057151 0.168893 0.087195 0.077218
5 0.159467 0.225617 0.099206 0.227997 0.137927 0.102015
xsmt_hm_secpln xsmt_hm_asstpln xsmt_hm_devpln fin_app_trst
1 0.091335 0.080756 0.103926 0.476561
2 0.103691 0.074386 0.097439 0.476269
3 0.174145 0.117362 0.141742 0.432556
4 0.084650 0.061744 0.074103 0.389897
5 0.124923 0.082604 0.099543 0.435099
This plot of the centroids for the 5 segments shows much greater complexity than did the earlier solutions. Perhaps too much complexity. The segments are getting harder to differentiate because there is so much overlap.
library(reshape2) # it's necessary to reshape the data into 'long' format
fs5c_long<- melt(fs.km5@model$centers ) # need to reshape to 'long' form
library(ggplot2)
ggplot(data=fs5c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (0-0.6)") +
ylim(0,0.6) +
annotate("text", x = 3, y = 1, label = "Base = 1036", size=4) Now let’s look at comparison among the sums-of-squares information to help determine the number of segments to investigate further. The 5-segment solution has the highest between segment sums-of-squares.
pacman::p_load(sjPlot)
tab_df(dfK, sort.column = -3, show.rownames = FALSE,
title = "Sums of Squares for Level 1 Segmentation", digits=2)| Numbr_Segments | TotWithinSS | BetweenSS | TotSS |
|---|---|---|---|
| 5 | 5433.62 | 3605.08 | 9038.71 |
| 4 | 5756.99 | 3281.72 | 9038.71 |
| 3 | 6258.32 | 2780.38 | 9038.71 |
| 2 | 6950.98 | 2087.72 | 9038.71 |
Before looking at comparisons of the sums-of-squares, we have looked into a 6-segment solution.
6.6.4 A 6 segment solution
As usual, the clustering algorithm groups the respondents.
k_iterations <- 6
fs.km6<- h2o.kmeans(fs.class[,1:15], k=k_iterations, estimate_k = FALSE,
init=c("PlusPlus"), standardize= FALSE,
score_each_iteration= FALSE, seed = 7238,
keep_cross_validation_predictions=TRUE) Saving the sums-of-squares to the dfK data frame.
dfK[k_iterations-1,1] <- k_iterations
dfK[k_iterations-1,2] <- getTotWithinSS(fs.km6) # within segment sums of squares
dfK[k_iterations-1,3] <- getBetweenSS(fs.km6) # between segments sums of squares
dfK[k_iterations-1,4] <- getTotSS(fs.km6) # total sums of squaresThe sizes of the six clusters are getting a little uneven again, but it’s nothing to be concerned about : 26635 58409 34745 61340 42733 & 28864
[1] 26635 58409 34745 61340 42733 28864
The cluster means with centroid numbering, are provided below..
Cluster Means:
centroid confidence techenthu ann_prem_hm_insu mob_mark_cons mob_inf_sec_cons
1 1 0.191754 0.256654 0.160416 0.149149 0.217835
2 2 0.213186 0.208622 0.246539 0.132280 0.180940
3 3 0.281919 0.250519 0.176070 0.159439 0.196190
4 4 0.202426 0.178207 0.214629 0.115948 0.156440
5 5 0.245322 0.254933 0.179780 0.134451 0.183664
6 6 0.236315 0.207693 0.338468 0.137530 0.201611
xsmart_purch xsmt_therm xsmt_hm_secu xsmt_hm_ass smt_hm_dev xsmat_thermpln
1 0.204414 0.124788 0.067997 0.186497 0.099902 0.088469
2 0.131895 0.195959 0.078413 0.214738 0.121044 0.098349
3 0.199055 0.287712 0.147283 0.277895 0.174777 0.116015
4 0.103886 0.141856 0.055007 0.165026 0.084133 0.075502
5 0.164899 0.224125 0.105452 0.223130 0.135033 0.104734
6 0.127743 0.160589 0.062986 0.198947 0.101528 0.097754
xsmt_hm_secpln xsmt_hm_asstpln xsmt_hm_devpln fin_app_trst
1 0.089631 0.079154 0.103678 0.479893
2 0.105739 0.071653 0.090768 0.454770
3 0.173174 0.114121 0.141037 0.444190
4 0.082964 0.061023 0.073132 0.387059
5 0.135375 0.093152 0.107180 0.413531
6 0.104794 0.077433 0.102914 0.490498
This plot of the centroids for the 6 segments shows much greater complexity than did the earlier solutions. Perhaps too much complexity. The segments are getting harder to differentiate because there is so much overlap.
fs6c_long<- melt(fs.km6@model$centers ) # need to reshape to 'long' form
ggplot(data=fs6c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (0-0.55)") + ylim(0, 0.55) +
annotate("text", x = 3, y = 1, label = "Base = 1036", size=4) Having done the 1 Segment to 6 Segment Solutions, we have plotted their sums-of-squares information to help determine the number of segments to investigate further. With the addition of each segment the WithinSS decreases, and the BetweenSS increases, which is what we want to happen. The 6 segment solution has the lowest WithinSS and the highest Between SS. However, the rate of change for the WithinSS and BetweenSS drastically declined for the 6 Segment Solution. The 6 Segment Solution’s plot of centroid means
was also difficult to create differentiated personas out of. The lines are not that separated from each other, thus the segments aren’t that different. The 4th segment plot seemed to be the best from an interpretation standpoint, and it was therefore futher investigated.
tab_df(dfK, sort.column = -3, show.rownames = FALSE, title = "Sums of Squares for Level 1 Segmentation", digits=2)| Numbr_Segments | TotWithinSS | BetweenSS | TotSS |
|---|---|---|---|
| 6 | 5274.12 | 3764.59 | 9038.71 |
| 5 | 5433.62 | 3605.08 | 9038.71 |
| 4 | 5756.99 | 3281.72 | 9038.71 |
| 3 | 6258.32 | 2780.38 | 9038.71 |
| 2 | 6950.98 | 2087.72 | 9038.71 |
The graph below may provide a very rough guide for the number of segments, but is not definitive. The best number of segments may be where there is an elbow in the curve. An elbow seems to be placed at the ______ Segment Solution with diminishing returns (i.e. decreases) after it, but the 5-segment solution still increases the distances between segments and reduces the distances within segments.We know from the ggplot above that the 3 segment solution is not sufficient, which is why we stress that the ‘elbow’ guideline is simply a guideline
library(plotly)
dfK_melt <- melt(dfK, id=c("Numbr_Segments"))
p <- ggplot(data = dfK_melt, aes(x = Numbr_Segments, y = value)) +
geom_point(aes(), size = 4) +
geom_line(aes(colour = variable ), size=2) + facet_wrap(~ variable)+
labs(x="Numbers of segments", y="Sums of squares") +
theme(axis.title.x=element_text(size=16, face="bold", colour="blue")) +
theme(axis.title.y=element_text(size=16, face="bold", colour="blue")) +
theme(axis.text.x=element_text(size=14, face="bold" )) +
theme(axis.text.y=element_text(size=14, face="bold" )) +
ggtitle("Cluster statistics") + theme(legend.position="none")
(gg <- ggplotly(p)) 6.6.5 Building Predictive Models That Predict Segment Accuracy
We have used Random Forest, GBM, logistic regression to predict segment accuracy obtained using K-means.
6.6.5.1 Segment (cluster) assignments.
These segment assignments are obtained by predicting each respondent’s segment based on fs.km4 results. Below are the predicted segment assignments for the 4-segment solution.
clusters.hex <- h2o.predict(fs.km4, fs.class)
clusters.hex <- as.factor(clusters.hex) # cluster 'names' must be factors for modelingAnd, this table shows the segment sizes printed earlier in a slightly different manner.
predict Count
1 0 41569
2 1 76774
3 2 62595
4 3 71788
[4 rows x 2 columns]
Levels need to be set to “Seg1”,“Seg2”,“Seg3”,“Seg4” instead of 0, 1, 2, 3, 4.
predict
1 Seg2
2 Seg2
3 Seg2
4 Seg2
5 Seg2
6 Seg2
[252726 rows x 1 column]
#__Bind RID, predicted segments, and fs variables
fs4C.class<- h2o.cbind(fs.class[,c(1)], clusters.hex$predict, fs.class[, c(2:33)])
fs4C.class <- fs4C.class[, c(1:34)] # reorganize the data frameA table of assignments for all respondents follows.
predict Count
1 Seg1 41569
2 Seg2 76774
3 Seg3 62595
4 Seg4 71788
[4 rows x 2 columns]
These segment assignments and other file data have been saved in the following code-chunk.
6.6.5.2 Splitting the sample for cross validation
Confidence predict Tech.Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.009426962 Seg2 0.009739573 0.004403856 0.005647059
2 0.025949382 Seg2 0.014750739 0.005865521 0.012673267
3 0.025238024 Seg2 0.029873820 0.020744546 0.015555556
4 0.030433902 Seg2 0.028479670 0.009048419 0.018765432
5 0.056125319 Seg2 0.061331941 0.063795120 0.030182927
6 0.076100931 Seg2 0.039830300 0.092729259 0.022857143
Mob_Inf_sec_cons X..Smart_purch X.Smt_Therm X.Smt_Hm_Secu X.Smt_Hm_Ass
1 0.007529412 0.007152941 0.009788235 0.004517647 0.01204706
2 0.021386139 0.014257426 0.022178218 0.007128713 0.02534653
3 0.025555556 0.018888889 0.028888889 0.012222222 0.03222222
4 0.027654321 0.024691358 0.023703704 0.011851852 0.02666667
5 0.028170732 0.024146341 0.042256098 0.012073171 0.03823171
6 0.028571429 0.012857143 0.018571429 0.005714286 0.02000000
Smt_Hm_Dev X.Smat_Therm.Pln X.Smt_Hm_Sec.Pln X.Smt_Hm_Asst.Pln
1 0.007152941 0.004517647 0.007905882 0.002635294
2 0.011881188 0.011089109 0.012673267 0.005544554
3 0.018888889 0.011111111 0.018888889 0.011111111
4 0.018765432 0.017777778 0.021728395 0.012839506
5 0.024146341 0.032195122 0.028170732 0.034207317
6 0.012857143 0.011428571 0.011428571 0.012857143
X.Smt_Hm_Dev.Pln Fin_App_Trst CODE Pop_Age M_Pop_Age F_Pop_Age X._Pvt_HHld
1 0.004141176 0.01882353 K7L5R2 41.8 42.1 15.0 6.0
2 0.011089109 0.04514852 K9V6H4 32.9 32.3 37.5 11.0
3 0.008888889 0.05666667 N2L2G4 84.3 83.6 85.0 7.0
4 0.015802469 0.05827160 M6B2P9 83.8 76.9 86.0 14.0
5 0.030182927 0.09054878 N7L5H5 76.8 74.0 78.1 3.6
6 0.010000000 0.07571429 K6H6X7 81.2 82.6 80.0 2.0
X._Fam_Hhld X._NF_Hhld Avg_Chld_PrCen_Fm_Hhld Houses Apt
1 2 1 3.0 0.002352941 0.002352941
2 1 0 3.0 0.009900990 0.000000000
3 1 0 2.0 0.027777778 0.000000000
4 1 0 3.0 0.012345679 0.000000000
5 5 9 2.4 0.085365854 0.000000000
6 1 0 1.0 0.000000000 0.142857143
Condos Hhld.Income High.School Clg_CEGEP_Non_Uni_Dip Uni_Dip_Blw_Bach
1 0.002352941 127.5026 0.007058824 0.009411765 0.01411765
2 0.000000000 905.8534 0.039603960 0.019801980 0.00000000
3 0.000000000 2407.2750 0.000000000 0.000000000 0.08333333
4 0.000000000 1764.7486 0.049382716 0.012345679 0.01234568
5 0.048780488 336.3186 0.091463415 0.000000000 0.07926829
6 0.000000000 9959.4643 0.142857143 0.000000000 0.00000000
Empl UnEmp Tot_Mig
1 0.01176471 0.004705882 0.02352941
2 0.04950495 0.000000000 0.00990099
3 0.05555556 0.027777778 0.02777778
4 0.08641975 0.000000000 0.07407407
5 0.10365854 0.012195122 0.30487805
6 0.14285714 0.000000000 0.00000000
[176857 rows x 34 columns]
The testing or holdout or validation sample also has a column indicating the assigned segment for each individual.
Confidence predict Tech.Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.03188337 Seg2 0.02481761 0.01809865 0.01176471
2 0.05467271 Seg2 0.05312974 0.06034417 0.03090909
3 0.08517833 Seg2 0.10082414 0.09335045 0.04000000
4 0.10647280 Seg2 0.07367943 0.02017770 0.03656250
5 0.04973771 Seg2 0.05238778 0.02002063 0.03150000
6 0.10034801 Seg2 0.10569465 0.07024781 0.04342105
Mob_Inf_sec_cons X..Smart_purch X.Smt_Therm X.Smt_Hm_Secu X.Smt_Hm_Ass
1 0.02705882 0.02117647 0.02823529 0.009411765 0.03529412
2 0.03272727 0.02181818 0.04545454 0.009090909 0.04181818
3 0.03750000 0.04250000 0.05500000 0.012500000 0.05000000
4 0.03937500 0.05062500 0.04781250 0.019687500 0.06187500
5 0.04050000 0.03150000 0.05100000 0.031500000 0.04350000
6 0.04342105 0.05500000 0.06368421 0.043421053 0.06368421
Smt_Hm_Dev X.Smat_Therm.Pln X.Smt_Hm_Sec.Pln X.Smt_Hm_Asst.Pln
1 0.02000000 0.01647059 0.01411765 0.01529412
2 0.02909091 0.02909091 0.03272727 0.02181818
3 0.02000000 0.01500000 0.02500000 0.01000000
4 0.03937500 0.04218750 0.02531250 0.03656250
5 0.02850000 0.02550000 0.02400000 0.02250000
6 0.03473684 0.04052632 0.05210526 0.04921053
X.Smt_Hm_Dev.Pln Fin_App_Trst CODE Pop_Age M_Pop_Age F_Pop_Age X._Pvt_HHld
1 0.01294118 0.06588235 N2L2J2 85.0 86.1 83.5 5.0
2 0.02181818 0.09272727 L9M1Z4 37.5 35.0 67.5 2.0
3 0.02250000 0.13250000 N7G3J7 85.0 37.5 87.4 3.0
4 0.03093750 0.11531250 M9A5E4 86.6 33.2 88.6 13.0
5 0.02250000 0.07950000 M1N1M9 79.5 68.6 81.9 5.3
6 0.04342105 0.15052632 M1N4E6 68.9 65.0 70.0 3.0
X._Fam_Hhld X._NF_Hhld Avg_Chld_PrCen_Fm_Hhld Houses Apt
1 1 0 1.0 0.02941176 0.00000000
2 1 0 1.0 0.09090909 0.00000000
3 1 0 1.0 0.12500000 0.00000000
4 1 0 0.0 0.00000000 0.03125000
5 2 1 2.0 0.03750000 0.00000000
6 14 6 1.4 0.07894737 0.05263158
Condos Hhld.Income High.School Clg_CEGEP_Non_Uni_Dip Uni_Dip_Blw_Bach
1 0.00000000 1528.1129 0.02941176 0.02941176 0.02941176
2 0.00000000 28276.7809 0.09090909 0.00000000 0.09090909
3 0.00000000 10098.6787 0.12500000 0.00000000 0.00000000
4 0.03125000 2608.7778 0.09375000 0.06250000 0.12500000
5 0.01250000 359.3750 0.05000000 0.03750000 0.06250000
6 0.03289474 318.8611 0.07894737 0.07894737 0.11184210
Empl UnEmp Tot_Mig
1 0.08823529 0.029411765 0.02941176
2 0.18181818 0.000000000 0.00000000
3 0.25000000 0.000000000 0.00000000
4 0.09375000 0.031250000 0.21875000
5 0.11250000 0.000000000 0.11250000
6 0.11184210 0.006578947 0.22368421
[75869 rows x 34 columns]
6.7 Level 1: Random Forest
The first predictive model using randomForest. It is flexible and highly-valued methodology that splits the data using a tree-splitting methodology. Instead of building just one tree, the model below builds 200 trees.
fs4C.class_rf <- h2o.randomForest(
training_frame = fs.split[[1]],
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=c(1,3:16), ## the training sample predictor columns, by column index
y=2, ## the target index (what we are predicting)
model_id = "Random_Forest", ## name the model in H2O
## not required, but helps use Flow
ntrees = 200,
stopping_rounds = 2,
score_each_iteration = T,
seed = 1000000) 6.7.1 Confusion Matrices for RF
Training sample
While this might be interesting, it is best not to place too much credence on the hit-ratio for the training sample since this sample is used to build the model.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 29081 1 0 3 0.0001 = 4 / 29,085
Seg2 0 53576 0 3 0.0001 = 3 / 53,579
Seg3 0 0 43924 0 0.0000 = 0 / 43,924
Seg4 1 1 0 50267 0.0000 = 2 / 50,269
Totals 29082 53578 43924 50273 0.0001 = 9 / 176,857
Testing Sample
Successful prediction is clearly showcased for segments as seen in the confusion matrix below. Using the RF model, 3,895 out of 75,891 people were incorrectly assigned to their segment membership, resulting in an error rate of 5.13%, which is very good. This means that the hit ratio was roughly 94.87%, which is very credible.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 11475 404 291 314 0.0808 = 1,009 / 12,484
Seg2 326 22043 103 723 0.0497 = 1,152 / 23,195
Seg3 225 54 18118 274 0.0296 = 553 / 18,671
Seg4 200 642 450 20227 0.0600 = 1,292 / 21,519
Totals 12226 23143 18962 21538 0.0528 = 4,006 / 75,869
6.7.2 Hit Ratios for RF
As briefly mentioned above, the hit ratio is roughly 94.87%, which is fantastic. This alone is not enough to deduce that the model is good, therefore below we have created a recepticle of metrics to explore futher for all the models.
$train
Top-4 Hit Ratios:
k hit_ratio
1 1 0.935959
2 2 0.992796
3 3 0.997614
4 4 1.000000
$valid
Top-4 Hit Ratios:
k hit_ratio
1 1 0.947198
2 2 0.995703
3 3 0.999407
4 4 1.000000
6.7.3 Creating an object to hold important diagnostic information
A receptacle for diagnostic statistics is created below so that this information can be compared efficiently at the end of the predictive modeling. We will be comparing the hit ratios, MSEs, RMSEs, loglosses, and mean-per-class-errors.
modH <- data.frame(Prediction_model=character(),
hit_ratio=numeric(),
MSE=numeric(),
RMSE=numeric(),
logloss=numeric(),
mean_per_class_error=numeric(),
stringsAsFactors=FALSE) modH[1, 1] <- "Random_forest"
modH[1, 2] <- fs4C.class_rf@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
modH[1, 3] <- fs4C.class_rf@model$validation_metrics@metrics$MSE #
modH[1, 4] <- fs4C.class_rf@model$validation_metrics@metrics$RMSE #
modH[1, 5] <- fs4C.class_rf@model$validation_metrics@metrics$ logloss
modH[1, 6] <- fs4C.class_rf@model$validation_metrics@metrics$ mean_per_class_error 6.7.4 Plotting variable importances
Here we are predicting importance of variables using plotly package available in H2O function. We can see that the top five variables with the highest relative importances are:
- Dollar value of annual premiums for home insurance (by a long shot)
- Number of people who own smart thermostats
- Number of people having trust in financial/banking app
- Number of people who purchased smart phones
- Number of people who plan on purchasing smart home technology
rf_variable_importances <- as.data.frame(fs4C.class_rf@model$variable_importances)
# rf_variable_importances
#install.packages("plotly", dependencies=TRUE)
library(plotly)
plot_ly(rf_variable_importances,
# x = rf_variable_importances$percentage,
y=reorder(rf_variable_importances$variable,
rf_variable_importances$percentage),
x = rf_variable_importances$percentage,
color = rf_variable_importances$variable,
type = 'bar', orientation = 'h') %>%
layout( title = "Variable Importance for the random forest model",
xaxis = list(title = "Percentage Importance"),
ylim=c(0,1),
margin = list(l = 120)) 6.8 Level 1: Logistic Regression
Here we are implementing logistic regression model for the 4 segment solution.
fs4C.class_glm <- h2o.glm(
family= "multinomial",
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=c(1,3:16), ## the predictor columns, by column index
y=2,
lambda=0
) 6.8.1 Confusion Matrices for GLM
Training sample
Similarly to what we did for random forest, here we are checking confusion matrix for all segments within the training sample.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 29038 22 4 21 0.0016 = 47 / 29,085
Seg2 15 53532 1 31 0.0009 = 47 / 53,579
Seg3 22 2 43875 25 0.0011 = 49 / 43,924
Seg4 6 25 20 50218 0.0010 = 51 / 50,269
Totals 29081 53581 43900 50295 0.0011 = 194 / 176,857
Testing sample
Successful prediction is clearly showcased for segments as seen in the confusion matrix below. Using the GLM model, 102 out of 75,891 people were incorrectly assigned to their segment membership, resulting in an error rate of 0.13%, which is even better than random forest good. This means that the hit ratio was roughly 99.87%, which is even more credible than random forest.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 12450 16 1 17 0.0027 = 34 / 12,484
Seg2 8 23162 1 24 0.0014 = 33 / 23,195
Seg3 7 0 18654 10 0.0009 = 17 / 18,671
Seg4 5 15 11 21488 0.0014 = 31 / 21,519
Totals 12470 23193 18667 21539 0.0015 = 115 / 75,869
6.8.2 Plotting variable importance
Looking at the variable importances, we can see that the top five variables in terms of them having the highest relative importance are:
- Dollar value of annual premiums for home insurance (by a long shot)
- Number of people who own smart thermostats
- Number of people who purchased smart phones
- Number of people having trust in financial/banking app
- Number of people who plan on purchasing smart home technology
variable relative_importance scaled_importance percentage
1 Ann_Prem_Hm_Insu 46.408944 1.00000000 0.19169294
2 X.Smt_Therm 33.312419 0.71780169 0.13759752
3 X..Smart_purch 23.821684 0.51329943 0.09839588
4 Fin_App_Trst 23.526912 0.50694780 0.09717832
5 Tech.Enthu. 20.993106 0.45235043 0.08671239
6 X.Smt_Hm_Ass 17.327060 0.37335604 0.07156972
7 Confidence 15.051687 0.32432729 0.06217125
8 X.Smt_Hm_Sec.Pln 11.359219 0.24476356 0.04691945
9 Smt_Hm_Dev 11.028004 0.23762669 0.04555136
10 X.Smt_Hm_Secu 10.383195 0.22373263 0.04288797
11 X.Smt_Hm_Dev.Pln 7.100212 0.15299231 0.02932755
12 Mob_Inf_sec_cons 7.016669 0.15119217 0.02898247
13 X.Smat_Therm.Pln 5.232745 0.11275295 0.02161394
14 X.Smt_Hm_Asst.Pln 4.952652 0.10671763 0.02045702
15 Mob_Mark_cons 4.585922 0.09881548 0.01894223
#glm_variable_importances <- as.data.frame(fs4C.class_glm@model$variable_importances)
glm_variable_importances <- as.data.frame(h2o.varimp(fs4C.class_glm))
## rf_variable_importances
##install.packages("plotly", dependencies=TRUE)
pacman::p_load(plotly)
plot_ly(glm_variable_importances,
# x = rf_variable_importances$percentage,
y=reorder(glm_variable_importances$variable,
glm_variable_importances$percentage),
x = glm_variable_importances$percentage,
color = glm_variable_importances$variable,
type = 'bar', orientation = 'h') %>%
layout( title = "Variable Importance for the logistic regression model on 4 segments",
xaxis = list(title = "Percentage Importance"),
ylim=c(0,1),
margin = list(l = 120))modH[2, 1] <- "GLM_log_regr"
modH[2, 2] <- fs4C.class_glm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
modH[2, 3] <- fs4C.class_glm@model$validation_metrics@metrics$MSE #
modH[2, 4] <- fs4C.class_glm@model$validation_metrics@metrics$RMSE #
modH[2, 5] <- fs4C.class_glm@model$validation_metrics@metrics$ logloss
modH[2, 6] <- fs4C.class_glm@model$validation_metrics@metrics$ mean_per_class_error 6.9 Level 1: Gradient Boosting Machine
The gradient boosting machine model is a type of neural network algorithm that can be very effective. Hence we are trying to explore this model as well to check if this provides better results or not.
#GBM Gradient Boosting Machine
fs4C.class_gbm<- h2o.gbm(
distribution="AUTO",
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=c(1,3:16), ## the predictor columns, by column index
y=2,
model_id = "fs.gbm",
stopping_rounds = 3,
histogram_type = "UniformAdaptive" ,
stopping_tolerance = 1e-2,
seed = 1234
)6.9.1 Model Performance
H2OMultinomialMetrics: gbm
Test Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.05102248
RMSE: (Extract with `h2o.rmse`) 0.2258816
Logloss: (Extract with `h2o.logloss`) 0.1765651
Mean Per-Class Error: 0.05254961
R^2: (Extract with `h2o.r2`) 0.954615
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 11603 335 233 313 0.0706 = 881 / 12,484
Seg2 243 22154 66 732 0.0449 = 1,041 / 23,195
Seg3 347 99 17931 294 0.0396 = 740 / 18,671
Seg4 178 595 413 20333 0.0551 = 1,186 / 21,519
Totals 12371 23183 18643 21672 0.0507 = 3,848 / 75,869
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
=======================================================================
Top-4 Hit Ratios:
k hit_ratio
1 1 0.949281
2 2 0.995611
3 3 0.999697
4 4 1.000000
6.9.2 Confusion Matrices
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 27376 630 430 649 0.0588 = 1,709 / 29,085
Seg2 453 51684 101 1341 0.0354 = 1,895 / 53,579
Seg3 587 167 42569 601 0.0308 = 1,355 / 43,924
Seg4 312 1114 863 47980 0.0455 = 2,289 / 50,269
Totals 28728 53595 43963 50571 0.0410 = 7,248 / 176,857
Testing Sample
The confusion matrix for the testing sample is as below
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 11603 335 233 313 0.0706 = 881 / 12,484
Seg2 243 22154 66 732 0.0449 = 1,041 / 23,195
Seg3 347 99 17931 294 0.0396 = 740 / 18,671
Seg4 178 595 413 20333 0.0551 = 1,186 / 21,519
Totals 12371 23183 18643 21672 0.0507 = 3,848 / 75,869
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 11603 335 233 313 0.0706 = 881 / 12,484
Seg2 243 22154 66 732 0.0449 = 1,041 / 23,195
Seg3 347 99 17931 294 0.0396 = 740 / 18,671
Seg4 178 595 413 20333 0.0551 = 1,186 / 21,519
Totals 12371 23183 18643 21672 0.0507 = 3,848 / 75,869
Training Sample
Using the GBM model, 3,714 out of 75,891 people were incorrectly assigned to their segment membership, resulting in an error rate of 4.89%, which is better than that of the Random Forest Model, but not as low as the GLM Model. This means that the hit ratio was roughly 95.10%, which is still a great hit ratio. All the models seem to have great hit ratios (all are 94% or above).
# str(fs6C.class_gbm)
# fs6C.class_glm@ model$ validation_metrics@ metrics$ cm$ table
h2o.confusionMatrix(fs4C.class_gbm,valid = TRUE)Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 11603 335 233 313 0.0706 = 881 / 12,484
Seg2 243 22154 66 732 0.0449 = 1,041 / 23,195
Seg3 347 99 17931 294 0.0396 = 740 / 18,671
Seg4 178 595 413 20333 0.0551 = 1,186 / 21,519
Totals 12371 23183 18643 21672 0.0507 = 3,848 / 75,869
6.9.3 Plotting variable importance
Variable Importances:
variable relative_importance scaled_importance percentage
1 X.Smt_Therm 107209.156250 1.000000 0.214047
2 Ann_Prem_Hm_Insu 102356.835938 0.954740 0.204359
3 Fin_App_Trst 70859.429688 0.660946 0.141474
4 X..Smart_purch 61462.792969 0.573298 0.122713
5 X.Smt_Hm_Sec.Pln 33581.839844 0.313237 0.067047
6 Tech.Enthu. 29895.560547 0.278853 0.059688
7 X.Smt_Hm_Ass 24125.326172 0.225030 0.048167
8 X.Smt_Hm_Secu 20156.285156 0.188009 0.040243
9 Smt_Hm_Dev 15659.624023 0.146066 0.031265
10 Confidence 14745.135742 0.137536 0.029439
11 Mob_Inf_sec_cons 8862.408203 0.082665 0.017694
12 X.Smt_Hm_Dev.Pln 8198.396484 0.076471 0.016368
13 X.Smat_Therm.Pln 1741.412231 0.016243 0.003477
14 X.Smt_Hm_Asst.Pln 1040.876465 0.009709 0.002078
15 Mob_Mark_cons 971.792908 0.009064 0.001940
Looking at the variable importances, we can see that the top five variables in terms of them having the highest relative importance are:
- Number of people who own smart thermostats
- Dollar value of annual premiums for home insurance
- Number of people who trust in financial banking applications
- Number of people who purchased smart phones
- Number of people planning to purchase smart home security technology
# glm_variable_importances <- as.data.frame(fs6C.class_glm@model$variable_importances)
gbm_variable_importances <- as.data.frame(h2o.varimp(fs4C.class_gbm))
# rf_variable_importances
#install.packages("plotly", dependencies=TRUE)
pacman::p_load(plotly)
plot_ly(gbm_variable_importances,
# x = rf_variable_importances$percentage,
y=reorder(gbm_variable_importances$variable,
gbm_variable_importances$percentage),
x = gbm_variable_importances$percentage,
color = gbm_variable_importances$variable,
type = 'bar', orientation = 'h') %>%
layout( title = "Variable Importance for GBM model of 4-segment solution",
xaxis = list(title = "Percentage Importance"),
ylim=c(0,1),
margin = list(l = 120))6.10 Level 1: Comparing All of the Models
The GLM logistic regression model performed the best across all metrics, and is thus the best predictor. It had the:
- Highest hit ratio
- Lowest MSE
- Lowest RSE
- Lowest log loss
- Lowest mean-per-class-error
modH[3, 1] <- "Gradient Boosting"
modH[3, 2] <- fs4C.class_gbm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
modH[3, 3] <- fs4C.class_gbm@model$validation_metrics@metrics$MSE #
modH[3, 4] <- fs4C.class_gbm@model$validation_metrics@metrics$RMSE #
modH[3, 5] <- fs4C.class_gbm@model$validation_metrics@metrics$ logloss
modH[3, 6] <- fs4C.class_gbm@model$validation_metrics@metrics$ mean_per_class_error modH %>% tab_df( show.rownames = TRUE, sort.column = -2, title = "Statistics of predictive models for the 4-segment solution")| Row | Prediction_model | hit_ratio | MSE | RMSE | logloss | mean_per_class_error |
|---|---|---|---|---|---|---|
| 2 | GLM_log_regr | 1.00 | 0.01 | 0.07 | 0.02 | 0.00 |
| 3 | Gradient Boosting | 0.95 | 0.05 | 0.23 | 0.18 | 0.05 |
| 1 | Random_forest | 0.95 | 0.06 | 0.24 | 0.21 | 0.06 |
This plot simply compares the hit ratios across all models. As mentioned above, the GLM (logistic regression) has the highest hit-ratio of 0.9984842. This along with the other metrics mentioned above, suggest that GLM is the superior model.
modH_hit <- modH[, 1:2]
modH_hit_long <- melt(modH_hit)
# Using Prediction_model as id variables
ggplot(data=modH_hit_long, aes(x= reorder(Prediction_model, -value), y=value ) ) +
geom_bar( stat="identity" , fill = "lightblue", width = 0.3)+
geom_point(aes( color= Prediction_model ), size=6) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
labs( title= "Hit ratios for predictive models", y="Proportion correct",
x= "Prediction Model")6.11 Level 2 Segmentation
Level 2 of the segmentation process focuses on the very practical problem of attempting to assign people to segments based on variables that may be more accessible than the attitudinal variables, i.e., demographic variables. This second level of segmentation will use only the demographic variables which include the following:
- Total Population Median Age
- Male Population Median Age
- Female Population Median Age
- Average Number Of Persons In Private Households
- Total Family Households
- Non-Family Households
- Average Children Per Census Family Household
- Houses
- Apartment, Building Low And High Rise
- Condos
- Median Household Income
- High School Certificate Or Equivalent
- College, CEGEP Or Other Non-University Certificate Or Diploma
- University Certificate Or Diploma Below Bachelor
- Employed
- Unemployed
- Total Immigrant
Labeled buckets for each of the demographic variables have been prepared in the chunk below for interpretation purposes.
fs4C.class$Pop_Age_T<- cut(fs4C.class$Pop_Age, c(-1,25,50,75,100), labels = c("0-25 Years", "26-50 Years", "51-75 Years", "More than 76 Years"))
fs4C.class$M_Pop_Age_T<- cut(fs4C.class$M_Pop_Age, c(-1,25,50,75,100), labels = c("0-25 Years", "26-50 Years", "51-75 Years", "More than 76 Years"))
fs4C.class$F_Pop_Age_T<- cut(fs4C.class$F_Pop_Age, c(-1,25,50,75,100), labels = c("0-25 Years", "26-50 Years", "51-75 Years", "More than 76 Years"))
fs4C.class$X._Pvt_HHld_T<- cut(fs4C.class$X._Pvt_HHld, c(-1,2,4,6,8,15), labels = c("0-2 Persons Private Households", "3-4 Persons Private Households", "5-6 Persons Private Households", "7-8 Persons Private Households", "8+ Persons Private Households"))
fs4C.class$X._Fam_Hhld_T<- cut(fs4C.class$X._Fam_Hhld, c(-1,5,10,15,20,25,30,35,40,10000), labels = c("0-5 Family Households", "6-10 Family Households", "11-15 Family Households", "16-20 Family Households", "21-25 Family Households", "26-30 Family Households", "31-35 Family Households", "36-40 Family Households", "40+ Family Households"))
fs4C.class$X._NF_Hhld_T<- cut(fs4C.class$X._NF_Hhld, c(-1,1,3,5,7,1500), labels = c("0-1 Non-Family Households", "2-3 Non-Family Households", "4-5 Non-Family Households", "6-7 Non-Family Households", "9+ Non-Family Households"))
fs4C.class$Avg_Chld_PrCen_Fm_Hhld_T<- cut(fs4C.class$Avg_Chld_PrCen_Fm_Hhld, c(-1,1,2,3,4,14), labels = c("0-1 Children Per Household", "2 Children Per Household", "3 Children Per Household", "4 Children Per Household", "5 or More Children Per Household"))
#Houses
fs4C.class$Houses_T<- cut(fs4C.class$Houses, c(-1,0.25,0.50,0.75,1.00),
labels = c("0-25% Living in Houses",
"26-50% Living in Houses",
"51-75% Living in Houses",
"More than 76% Living in Houses"))
#Apartments
fs4C.class$Apt_T<- cut(fs4C.class$Apt, c(-1,0.25,0.50,0.75,1.00),
labels = c("0-25% Living in Apartments",
"26-50% Years Living in Apartments",
"51-75% Living in Apartments",
"More than 76% Living in Apartments"))
#Condos
fs4C.class$Condos_T<- cut(fs4C.class$Condos, c(-1,0.25,0.50,0.75,1.00),
labels = c("0-25% Living in Condos",
"26-50% Years Living in Condos",
"51-75% Living in Condos",
"More than 76% Living in Condos"))
#Household Income
fs4C.class$Hhld.Income_T<- cut(fs4C.class$Hhld.Income, c(-1,20000,40000,60000,80000,100000,125000,150000,200000,300000,1600000),
labels = c("Household income $0 to $19,999",
"Household income $20,000 to $39,999",
"Household income $40,000 to $59,999",
"Household income $60,000 to $79,999",
"Household income $80,000 to $99,999",
"Household income $100,000 to $124,999",
"Household income $125,000 to $149,999",
"Household income $150,000 to $199,999",
"Household income $200,000 to $299,999",
"Household income $300,000 or over"))
#High School
fs4C.class$High.School_T<- cut(fs4C.class$High.School, c(-1,0.25,0.50,0.75,1.00),
labels = c("0-25% completed High School",
"26-50% completed High School",
"51-75% completed High School",
"More than 76% completed High School"))
#Non-University Diploma
fs4C.class$Clg_CEGEP_Non_Uni_Dip_T<- cut(fs4C.class$Clg_CEGEP_Non_Uni_Dip, c(-1,0.25,0.50,0.75,1.00),
labels = c("0-25% completed Non-University Diploma",
"26-50% completed Non-University Diploma",
"51-75% completed Non-University Diploma",
"More than 76% completed Non-University Diploma"))
#University Degree
fs4C.class$Uni_Dip_Blw_Bach_T<- cut(fs4C.class$Uni_Dip_Blw_Bach, c(-1,0.25,0.50,0.75,1.00),
labels = c("0-25% completed University Degree",
"26-50% completed University Degree",
"51-75% completed University Degree",
"More than 76% completed University Degree"))
#Employed
fs4C.class$Empl_T<- cut(fs4C.class$Empl, c(-1,0.25,0.50,0.75,1.00),
labels = c("0-25% Employed",
"26-50% Employed",
"51-75% Employed",
"More than 76% Employed"))
#Unemployed
fs4C.class$UnEmp_T<- cut(fs4C.class$UnEmp, c(-1,0.25,0.50,0.75,1.00),
labels = c("0-25% Unemployed",
"26-50% Unemployed",
"51-75% Unemployed",
"More than 76% Unemployed"))
#Total Immigrants
fs4C.class$Tot_Mig_T<- cut(fs4C.class$Tot_Mig, c(-1,0.25,0.50,0.75,1.00),
labels = c("0-25% Immigrants",
"26-50% Immigrants",
"51-75% Immigrants",
"More than 76% Immigrants"))Once we have recoded demographic variables we are selecting only demographic variables from the entire dataset for further analysis
Confidence predict Tech.Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.009426962 Seg2 0.009739573 0.004403856 0.005647059
2 0.025949382 Seg2 0.014750739 0.005865521 0.012673267
3 0.025238024 Seg2 0.029873820 0.020744546 0.015555556
4 0.031883370 Seg2 0.024817610 0.018098652 0.011764706
5 0.030433902 Seg2 0.028479670 0.009048419 0.018765432
6 0.056125319 Seg2 0.061331941 0.063795120 0.030182927
Mob_Inf_sec_cons X..Smart_purch X.Smt_Therm X.Smt_Hm_Secu X.Smt_Hm_Ass
1 0.007529412 0.007152941 0.009788235 0.004517647 0.01204706
2 0.021386139 0.014257426 0.022178218 0.007128713 0.02534653
3 0.025555556 0.018888889 0.028888889 0.012222222 0.03222222
4 0.027058824 0.021176471 0.028235294 0.009411765 0.03529412
5 0.027654321 0.024691358 0.023703704 0.011851852 0.02666667
6 0.028170732 0.024146341 0.042256098 0.012073171 0.03823171
Smt_Hm_Dev X.Smat_Therm.Pln X.Smt_Hm_Sec.Pln X.Smt_Hm_Asst.Pln
1 0.007152941 0.004517647 0.007905882 0.002635294
2 0.011881188 0.011089109 0.012673267 0.005544554
3 0.018888889 0.011111111 0.018888889 0.011111111
4 0.020000000 0.016470588 0.014117647 0.015294118
5 0.018765432 0.017777778 0.021728395 0.012839506
6 0.024146341 0.032195122 0.028170732 0.034207317
X.Smt_Hm_Dev.Pln Fin_App_Trst CODE Pop_Age_T M_Pop_Age_T
1 0.004141176 0.01882353 K7L5R2 26-50 Years 26-50 Years
2 0.011089109 0.04514852 K9V6H4 26-50 Years 26-50 Years
3 0.008888889 0.05666667 N2L2G4 More than 76 Years More than 76 Years
4 0.012941176 0.06588235 N2L2J2 More than 76 Years More than 76 Years
5 0.015802469 0.05827160 M6B2P9 More than 76 Years More than 76 Years
6 0.030182927 0.09054878 N7L5H5 More than 76 Years 51-75 Years
F_Pop_Age_T X._Pvt_HHld_T X._Fam_Hhld_T
1 0-25 Years 5-6 Persons Private Households 0-5 Family Households
2 26-50 Years 8+ Persons Private Households 0-5 Family Households
3 More than 76 Years 7-8 Persons Private Households 0-5 Family Households
4 More than 76 Years 5-6 Persons Private Households 0-5 Family Households
5 More than 76 Years 8+ Persons Private Households 0-5 Family Households
6 More than 76 Years 3-4 Persons Private Households 0-5 Family Households
X._NF_Hhld_T Avg_Chld_PrCen_Fm_Hhld_T Houses_T
1 0-1 Non-Family Households 3 Children Per Household 0-25% Living in Houses
2 0-1 Non-Family Households 3 Children Per Household 0-25% Living in Houses
3 0-1 Non-Family Households 2 Children Per Household 0-25% Living in Houses
4 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
5 0-1 Non-Family Households 3 Children Per Household 0-25% Living in Houses
6 9+ Non-Family Households 3 Children Per Household 0-25% Living in Houses
Apt_T Condos_T
1 0-25% Living in Apartments 0-25% Living in Condos
2 0-25% Living in Apartments 0-25% Living in Condos
3 0-25% Living in Apartments 0-25% Living in Condos
4 0-25% Living in Apartments 0-25% Living in Condos
5 0-25% Living in Apartments 0-25% Living in Condos
6 0-25% Living in Apartments 0-25% Living in Condos
Hhld.Income_T High.School_T
1 Household income $0 to $19,999 0-25% completed High School
2 Household income $0 to $19,999 0-25% completed High School
3 Household income $0 to $19,999 0-25% completed High School
4 Household income $0 to $19,999 0-25% completed High School
5 Household income $0 to $19,999 0-25% completed High School
6 Household income $0 to $19,999 0-25% completed High School
Clg_CEGEP_Non_Uni_Dip_T Uni_Dip_Blw_Bach_T
1 0-25% completed Non-University Diploma 0-25% completed University Degree
2 0-25% completed Non-University Diploma 0-25% completed University Degree
3 0-25% completed Non-University Diploma 0-25% completed University Degree
4 0-25% completed Non-University Diploma 0-25% completed University Degree
5 0-25% completed Non-University Diploma 0-25% completed University Degree
6 0-25% completed Non-University Diploma 0-25% completed University Degree
Empl_T UnEmp_T Tot_Mig_T
1 0-25% Employed 0-25% Unemployed 0-25% Immigrants
2 0-25% Employed 0-25% Unemployed 0-25% Immigrants
3 0-25% Employed 0-25% Unemployed 0-25% Immigrants
4 0-25% Employed 0-25% Unemployed 0-25% Immigrants
5 0-25% Employed 0-25% Unemployed 0-25% Immigrants
6 0-25% Employed 0-25% Unemployed 26-50% Immigrants
Here we are renaming variables to the original names that we have considered initially
library(sjmisc)
names(fs4C.class)[names(fs4C.class) == "Pop_Age_T"] <- "Pop_Age"
names(fs4C.class)[names(fs4C.class) == "M_Pop_Age_T"] <- "M_Pop_Age"
names(fs4C.class)[names(fs4C.class) == "F_Pop_Age_T"] <- "F_Pop_Age"
names(fs4C.class)[names(fs4C.class) == "X._Pvt_HHld_T"] <- "Pvt_HHld"
names(fs4C.class)[names(fs4C.class) == "X._Fam_Hhld_T"] <- "Fam_Hhld"
names(fs4C.class)[names(fs4C.class) == "X._NF_Hhld_T"] <- "NF_Hhld"
names(fs4C.class)[names(fs4C.class) == "Avg_Chld_PrCen_Fm_Hhld_T"] <- "Avg_Chld_PrCen_Fm_Hhld"
names(fs4C.class)[names(fs4C.class) == "Houses_T"] <- "Houses"
names(fs4C.class)[names(fs4C.class) == "Apt_T"] <- "Apt"
names(fs4C.class)[names(fs4C.class) == "Condos_T"] <- "Condos"
names(fs4C.class)[names(fs4C.class) == "Hhld.Income_T"] <- "Hhld.Income"
names(fs4C.class)[names(fs4C.class) == "High.School_T"] <- "High.School"
names(fs4C.class)[names(fs4C.class) == "Clg_CEGEP_Non_Uni_Dip_T"] <- "Clg_CEGEP_Non_Uni_Dip"
names(fs4C.class)[names(fs4C.class) == "Uni_Dip_Blw_Bach_T"] <- "Uni_Dip_Blw_Bach"
names(fs4C.class)[names(fs4C.class) == "Empl_T"] <- "Empl"
names(fs4C.class)[names(fs4C.class) == "UnEmp_T"] <- "UnEmp"
names(fs4C.class)[names(fs4C.class) == "Tot_Mig_T"] <- "Tot_Mig"
colnames(fs4C.class) [1] "Confidence" "predict" "Tech.Enthu."
[4] "Ann_Prem_Hm_Insu" "Mob_Mark_cons" "Mob_Inf_sec_cons"
[7] "X..Smart_purch" "X.Smt_Therm" "X.Smt_Hm_Secu"
[10] "X.Smt_Hm_Ass" "Smt_Hm_Dev" "X.Smat_Therm.Pln"
[13] "X.Smt_Hm_Sec.Pln" "X.Smt_Hm_Asst.Pln" "X.Smt_Hm_Dev.Pln"
[16] "Fin_App_Trst" "CODE" "Pop_Age"
[19] "M_Pop_Age" "F_Pop_Age" "Pvt_HHld"
[22] "Fam_Hhld" "NF_Hhld" "Avg_Chld_PrCen_Fm_Hhld"
[25] "Houses" "Apt" "Condos"
[28] "Hhld.Income" "High.School" "Clg_CEGEP_Non_Uni_Dip"
[31] "Uni_Dip_Blw_Bach" "Empl" "UnEmp"
[34] "Tot_Mig"
6.11.1 Variable Exploration
The following bar charts simply show the number of people or number of households in each category.
6.11.2 Splitting the sample for cross-validation
We will start with splitting our dataset into training and testing set.
#__splitting df into training (70%) and testing (validation) datasets________
fs.split<- h2o.splitFrame(fs4C.class, ratios=c(0.7))The training sample on which the predictive model will be built is below.
Confidence predict Tech.Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.009426962 Seg2 0.009739573 0.004403856 0.005647059
2 0.025949382 Seg2 0.014750739 0.005865521 0.012673267
3 0.025238024 Seg2 0.029873820 0.020744546 0.015555556
4 0.031883370 Seg2 0.024817610 0.018098652 0.011764706
5 0.056125319 Seg2 0.061331941 0.063795120 0.030182927
6 0.030407734 Seg2 0.031585243 0.020467166 0.011842105
Mob_Inf_sec_cons X..Smart_purch X.Smt_Therm X.Smt_Hm_Secu X.Smt_Hm_Ass
1 0.007529412 0.007152941 0.009788235 0.004517647 0.01204706
2 0.021386139 0.014257426 0.022178218 0.007128713 0.02534653
3 0.025555556 0.018888889 0.028888889 0.012222222 0.03222222
4 0.027058824 0.021176471 0.028235294 0.009411765 0.03529412
5 0.028170732 0.024146341 0.042256098 0.012073171 0.03823171
6 0.030263158 0.015789474 0.034210526 0.019736842 0.04473684
Smt_Hm_Dev X.Smat_Therm.Pln X.Smt_Hm_Sec.Pln X.Smt_Hm_Asst.Pln
1 0.007152941 0.004517647 0.007905882 0.002635294
2 0.011881188 0.011089109 0.012673267 0.005544554
3 0.018888889 0.011111111 0.018888889 0.011111111
4 0.020000000 0.016470588 0.014117647 0.015294118
5 0.024146341 0.032195122 0.028170732 0.034207317
6 0.025000000 0.010526316 0.019736842 0.009210526
X.Smt_Hm_Dev.Pln Fin_App_Trst CODE Pop_Age M_Pop_Age
1 0.004141176 0.01882353 K7L5R2 26-50 Years 26-50 Years
2 0.011089109 0.04514852 K9V6H4 26-50 Years 26-50 Years
3 0.008888889 0.05666667 N2L2G4 More than 76 Years More than 76 Years
4 0.012941176 0.06588235 N2L2J2 More than 76 Years More than 76 Years
5 0.030182927 0.09054878 N7L5H5 More than 76 Years 51-75 Years
6 0.018421053 0.07236842 N3L4E3 51-75 Years 51-75 Years
F_Pop_Age Pvt_HHld Fam_Hhld
1 0-25 Years 5-6 Persons Private Households 0-5 Family Households
2 26-50 Years 8+ Persons Private Households 0-5 Family Households
3 More than 76 Years 7-8 Persons Private Households 0-5 Family Households
4 More than 76 Years 5-6 Persons Private Households 0-5 Family Households
5 More than 76 Years 3-4 Persons Private Households 0-5 Family Households
6 51-75 Years 7-8 Persons Private Households 0-5 Family Households
NF_Hhld Avg_Chld_PrCen_Fm_Hhld Houses
1 0-1 Non-Family Households 3 Children Per Household 0-25% Living in Houses
2 0-1 Non-Family Households 3 Children Per Household 0-25% Living in Houses
3 0-1 Non-Family Households 2 Children Per Household 0-25% Living in Houses
4 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
5 9+ Non-Family Households 3 Children Per Household 0-25% Living in Houses
6 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
Apt Condos
1 0-25% Living in Apartments 0-25% Living in Condos
2 0-25% Living in Apartments 0-25% Living in Condos
3 0-25% Living in Apartments 0-25% Living in Condos
4 0-25% Living in Apartments 0-25% Living in Condos
5 0-25% Living in Apartments 0-25% Living in Condos
6 0-25% Living in Apartments 0-25% Living in Condos
Hhld.Income High.School
1 Household income $0 to $19,999 0-25% completed High School
2 Household income $0 to $19,999 0-25% completed High School
3 Household income $0 to $19,999 0-25% completed High School
4 Household income $0 to $19,999 0-25% completed High School
5 Household income $0 to $19,999 0-25% completed High School
6 Household income $0 to $19,999 0-25% completed High School
Clg_CEGEP_Non_Uni_Dip Uni_Dip_Blw_Bach
1 0-25% completed Non-University Diploma 0-25% completed University Degree
2 0-25% completed Non-University Diploma 0-25% completed University Degree
3 0-25% completed Non-University Diploma 0-25% completed University Degree
4 0-25% completed Non-University Diploma 0-25% completed University Degree
5 0-25% completed Non-University Diploma 0-25% completed University Degree
6 0-25% completed Non-University Diploma 0-25% completed University Degree
Empl UnEmp Tot_Mig
1 0-25% Employed 0-25% Unemployed 0-25% Immigrants
2 0-25% Employed 0-25% Unemployed 0-25% Immigrants
3 0-25% Employed 0-25% Unemployed 0-25% Immigrants
4 0-25% Employed 0-25% Unemployed 0-25% Immigrants
5 0-25% Employed 0-25% Unemployed 26-50% Immigrants
6 0-25% Employed 0-25% Unemployed 0-25% Immigrants
[176995 rows x 34 columns]
The holdout, or testing, sample will be used to help understand the ability of the model to predict segment membership.
Confidence predict Tech.Enthu. Ann_Prem_Hm_Insu Mob_Mark_cons
1 0.03043390 Seg2 0.02847967 0.009048419 0.01876543
2 0.07610093 Seg2 0.03983030 0.092729259 0.02285714
3 0.17222783 Seg2 0.09188291 0.191906973 0.04990826
4 0.08864146 Seg2 0.09336437 0.015857921 0.03712871
5 0.07195438 Seg2 0.07085803 0.070776382 0.04000000
6 0.06057126 Seg2 0.07169717 0.074680364 0.03733333
Mob_Inf_sec_cons X..Smart_purch X.Smt_Therm X.Smt_Hm_Secu X.Smt_Hm_Ass
1 0.02765432 0.02469136 0.02370370 0.011851852 0.02666667
2 0.02857143 0.01285714 0.01857143 0.005714286 0.02000000
3 0.04366972 0.05614679 0.03119266 0.006238532 0.04990826
4 0.04455445 0.06930693 0.06188119 0.032178218 0.06188119
5 0.04500000 0.04750000 0.05000000 0.020000000 0.06000000
6 0.04666667 0.03033333 0.05366667 0.018666667 0.04200000
Smt_Hm_Dev X.Smat_Therm.Pln X.Smt_Hm_Sec.Pln X.Smt_Hm_Asst.Pln
1 0.01876543 0.01777778 0.02172840 0.01283951
2 0.01285714 0.01142857 0.01142857 0.01285714
3 0.03119266 0.01871560 0.03119266 0.04990826
4 0.04702970 0.03465347 0.05693069 0.02722772
5 0.04500000 0.03750000 0.04500000 0.02500000
6 0.02566667 0.03733333 0.03966667 0.02566667
X.Smt_Hm_Dev.Pln Fin_App_Trst CODE Pop_Age M_Pop_Age
1 0.01580247 0.05827160 M6B2P9 More than 76 Years More than 76 Years
2 0.01000000 0.07571429 K6H6X7 More than 76 Years More than 76 Years
3 0.03743119 0.24330275 L4A0P5 51-75 Years 51-75 Years
4 0.03217822 0.13613861 M6B4E5 More than 76 Years 51-75 Years
5 0.04000000 0.13500000 L9M2B9 26-50 Years 26-50 Years
6 0.02333333 0.10266667 L9C2C7 More than 76 Years 51-75 Years
F_Pop_Age Pvt_HHld Fam_Hhld
1 More than 76 Years 8+ Persons Private Households 0-5 Family Households
2 More than 76 Years 0-2 Persons Private Households 0-5 Family Households
3 51-75 Years 0-2 Persons Private Households 21-25 Family Households
4 More than 76 Years 8+ Persons Private Households 0-5 Family Households
5 26-50 Years 3-4 Persons Private Households 0-5 Family Households
6 More than 76 Years 3-4 Persons Private Households 0-5 Family Households
NF_Hhld Avg_Chld_PrCen_Fm_Hhld Houses
1 0-1 Non-Family Households 3 Children Per Household 0-25% Living in Houses
2 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
3 9+ Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
4 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
5 0-1 Non-Family Households 0-1 Children Per Household 0-25% Living in Houses
6 0-1 Non-Family Households 2 Children Per Household 0-25% Living in Houses
Apt Condos
1 0-25% Living in Apartments 0-25% Living in Condos
2 0-25% Living in Apartments 0-25% Living in Condos
3 26-50% Years Living in Apartments 0-25% Living in Condos
4 0-25% Living in Apartments 0-25% Living in Condos
5 0-25% Living in Apartments 0-25% Living in Condos
6 0-25% Living in Apartments 0-25% Living in Condos
Hhld.Income High.School
1 Household income $0 to $19,999 0-25% completed High School
2 Household income $0 to $19,999 0-25% completed High School
3 Household income $0 to $19,999 0-25% completed High School
4 Household income $0 to $19,999 0-25% completed High School
5 Household income $0 to $19,999 0-25% completed High School
6 Household income $0 to $19,999 0-25% completed High School
Clg_CEGEP_Non_Uni_Dip Uni_Dip_Blw_Bach
1 0-25% completed Non-University Diploma 0-25% completed University Degree
2 0-25% completed Non-University Diploma 0-25% completed University Degree
3 0-25% completed Non-University Diploma 0-25% completed University Degree
4 0-25% completed Non-University Diploma 0-25% completed University Degree
5 0-25% completed Non-University Diploma 0-25% completed University Degree
6 0-25% completed Non-University Diploma 0-25% completed University Degree
Empl UnEmp Tot_Mig
1 0-25% Employed 0-25% Unemployed 0-25% Immigrants
2 0-25% Employed 0-25% Unemployed 0-25% Immigrants
3 0-25% Employed 0-25% Unemployed 0-25% Immigrants
4 0-25% Employed 0-25% Unemployed 0-25% Immigrants
5 0-25% Employed 0-25% Unemployed 0-25% Immigrants
6 0-25% Employed 0-25% Unemployed 0-25% Immigrants
[75731 rows x 34 columns]
6.12 Level 2: Random forest model
Here we are working on Random Forest model which attempts to predict segment membership from the 4-segment solution developed in Level 1 Segmentation by using only the demographic variables.
fs3Cov.class_rf <- h2o.randomForest( ## h2o.randomForest function
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=18:34, ## the predictor columns, by column index
y=2, ## the target index (what we are predicting)
model_id = "RF_cov",
ntrees = 200,
stopping_rounds = 2,
score_each_iteration = T,
seed = 1000000) ## Set the random seed so that this can be reproduced.6.12.1 Confusion Matrices
Training Sample The predictive ability of the random forest model on the training sample is quite poor. However, this information should not be relied upon.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 15287 5765 4458 3507 0.4732 = 13,730 / 29,017
Seg2 2542 36994 3719 10581 0.3128 = 16,842 / 53,836
Seg3 1524 2585 36299 3460 0.1725 = 7,569 / 43,868
Seg4 1653 11066 4100 33455 0.3345 = 16,819 / 50,274
Totals 21006 56410 48576 51003 0.3105 = 54,960 / 176,995
Testing Sample
The error rate and the corresponding hit-ratio for the testing sample are very poor compared to level one segmentation using random forest. 27,395 out of 75,754 people were incorrectly assigned their segment membership, resulting in an error rate of 36.16% or a hit ratio of 63.84%.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 5636 2861 2244 1811 0.5510 = 6,916 / 12,552
Seg2 1453 14567 1832 5086 0.3649 = 8,371 / 22,938
Seg3 859 1261 14797 1810 0.2099 = 3,930 / 18,727
Seg4 922 5349 2071 13172 0.3877 = 8,342 / 21,514
Totals 8870 24038 20944 21879 0.3639 = 27,559 / 75,731
As mentioned above, the hit ratio for the training sample is 63.84%.
$train
Top-4 Hit Ratios:
k hit_ratio
1 1 0.636606
2 2 0.878381
3 3 0.968664
4 4 1.000000
$valid
Top-4 Hit Ratios:
k hit_ratio
1 1 0.636094
2 2 0.878768
3 3 0.970171
4 4 1.000000
6.12.2 Variable Importances
While total immigrants, houses, university diploma education, average child per household, # of private households are the most important variables, they are not doing a very good job of predicting segment membership.
Variable Importances:
variable relative_importance scaled_importance percentage
1 Tot_Mig 212571.953125 1.000000 0.242930
2 Houses 93486.242188 0.439786 0.106837
3 Uni_Dip_Blw_Bach 82943.546875 0.390190 0.094789
4 Avg_Chld_PrCen_Fm_Hhld 71780.531250 0.337676 0.082032
5 Pvt_HHld 66663.859375 0.313606 0.076184
6 NF_Hhld 62827.562500 0.295559 0.071800
7 Fam_Hhld 54388.355469 0.255859 0.062156
8 Empl 46267.671875 0.217657 0.052875
9 Apt 44258.714844 0.208206 0.050579
10 Pop_Age 30073.251953 0.141473 0.034368
11 F_Pop_Age 25027.115234 0.117735 0.028601
12 M_Pop_Age 23832.818359 0.112116 0.027236
13 Hhld.Income 20380.693359 0.095877 0.023291
14 High.School 15530.393555 0.073059 0.017748
15 Clg_CEGEP_Non_Uni_Dip 12773.874023 0.060092 0.014598
16 Condos 10440.803711 0.049117 0.011932
17 UnEmp 1786.749756 0.008405 0.002042
6.12.3 The diagnostic statistics
The table below shows the hit-ratio, MSE (mean square error), RMSE (root mean square error), logloss and mean-per-class-error for the random forest model. This will be compared to the other models further along in this document.
# THE FOLLOWING covH data frame IS TO HOLD STATISTICS FROM EACH MODEL
covH <- data.frame(Prediction_model=character(),
hit_ratio=numeric(),
MSE=numeric(),
RMSE=numeric(),
logloss=numeric(),
mean_per_class_error=numeric(),
stringsAsFactors=FALSE)
covH[1, 1] <- "Random_forest"
covH[1, 2] <- fs3Cov.class_rf@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
covH[1, 3] <- fs3Cov.class_rf@model$validation_metrics@metrics$MSE #
covH[1, 4] <- fs3Cov.class_rf@model$validation_metrics@metrics$RMSE #
covH[1, 5] <- fs3Cov.class_rf@model$validation_metrics@metrics$ logloss
covH[1, 6] <- fs3Cov.class_rf@model$validation_metrics@metrics$ mean_per_class_error
covH Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
1 Random_forest 0.6360936 0.2957424 0.543822 0.8446302 0.3783833
6.13 Level 2: Logistic regression
Logistic regression using GLM is the second predictive model that will be investigated.
fs3Cov.class_glm<- h2o.glm(
family= "multinomial",
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=18:34, ## the predictor columns, by column index
y=2,
lambda=0
)6.13.1 Confusion Matrices
Training Sample
The error rate is quite poor on the training sample for the random forest model, roughly 37%.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 12908 7228 4835 4046 0.5552 = 16,109 / 29,017
Seg2 3820 33332 4145 12539 0.3809 = 20,504 / 53,836
Seg3 2244 3721 33868 4035 0.2280 = 10,000 / 43,868
Seg4 2403 12503 4536 30832 0.3867 = 19,442 / 50,274
Totals 21375 56784 47384 51452 0.3732 = 66,055 / 176,995
Testing Sample
The holdout sample error rate is almost as worse as that of the training sample. It is 37.19% for the training sample.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 5543 3146 2111 1752 0.5584 = 7,009 / 12,552
Seg2 1675 14121 1784 5358 0.3844 = 8,817 / 22,938
Seg3 908 1525 14483 1811 0.2266 = 4,244 / 18,727
Seg4 966 5407 1972 13169 0.3879 = 8,345 / 21,514
Totals 9092 24199 20350 22090 0.3752 = 28,415 / 75,731
Of course the hit-ratios, which are just 1 minus the error rates, are quite poor. 62.81% for the training sample.
$train
Top-4 Hit Ratios:
k hit_ratio
1 1 0.626797
2 2 0.872663
3 3 0.967745
4 4 1.000000
$valid
Top-4 Hit Ratios:
k hit_ratio
1 1 0.624790
2 2 0.870581
3 3 0.967266
4 4 1.000000
6.13.2 Diagnostic statistics for the GLM model
The table below shows that the diagnostic statistics for the random forest model and logistic regression model. So far, the Random Forest model looks overall better, but not good enough to present to a client.
covH[2, 1] <- "GLM"
covH[2, 2] <- fs3Cov.class_glm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
covH[2, 3] <- fs3Cov.class_glm@model$validation_metrics@metrics$MSE #
covH[2, 4] <- fs3Cov.class_glm@model$validation_metrics@metrics$RMSE #
covH[2, 5] <- fs3Cov.class_glm@model$validation_metrics@metrics$ logloss
covH[2, 6] <- fs3Cov.class_glm@model$validation_metrics@metrics$ mean_per_class_error
covH Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
1 Random_forest 0.6360936 0.2957424 0.543822 0.8446302 0.3783833
2 GLM 0.6247904 0.3068551 0.553945 0.8733787 0.3893232
6.14 Level 2: Deep Learning Model
The basic deep learning model has a very simple structure that can be expanded.
fs3Cov.class_dl<- h2o.deeplearning(
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=18:34, ## the predictor columns, by column index
y=2
)6.14.1 Confusion Matrices
Training Sample
The training sample error rate is 36.67% however, this does not really mean anything as we care more about the testing sample.
H2OMultinomialMetrics: deeplearning
** Reported on training data. **
** Metrics reported on temporary training frame with 9912 samples **
Training Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.2885144
RMSE: (Extract with `h2o.rmse`) 0.5371354
Logloss: (Extract with `h2o.logloss`) 0.8378529
Mean Per-Class Error: 0.3821669
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 650 239 385 327 0.5940 = 951 / 1,601
Seg2 200 1556 328 926 0.4831 = 1,454 / 3,010
Seg3 55 61 2091 279 0.1589 = 395 / 2,486
Seg4 74 464 286 1991 0.2927 = 824 / 2,815
Totals 979 2320 3090 3523 0.3656 = 3,624 / 9,912
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-4 Hit Ratios:
k hit_ratio
1 1 0.634383
2 2 0.877018
3 3 0.970742
4 4 1.000000
Testing Sample
Unfortunately, the validation sample error rate is in the same ball park as the other models for level 2 segmentation, at roughly 37.65%.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 4929 1965 3173 2485 0.6073 = 7,623 / 12,552
Seg2 1309 11545 2837 7247 0.4967 = 11,393 / 22,938
Seg3 402 477 15763 2085 0.1583 = 2,964 / 18,727
Seg4 612 3581 2322 14999 0.3028 = 6,515 / 21,514
Totals 7252 17568 24095 26816 0.3763 = 28,495 / 75,731
Again, the hit ratios are very poor. The training sample hit ratio is 62.35%.
$train
Top-4 Hit Ratios:
k hit_ratio
1 1 0.634383
2 2 0.877018
3 3 0.970742
4 4 1.000000
$valid
Top-4 Hit Ratios:
k hit_ratio
1 1 0.623734
2 2 0.873011
3 3 0.969035
4 4 1.000000
6.14.2 Diagnostic statistics
The diagnostics statistics table indicates that the Random Forest model is still the best performer in a very dismal race.
covH[3, 1] <- "Deep_Learning"
covH[3, 2] <- fs3Cov.class_dl@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
covH[3, 3] <- fs3Cov.class_dl@model$validation_metrics@metrics$MSE #
covH[3, 4] <- fs3Cov.class_dl@model$validation_metrics@metrics$RMSE #
covH[3, 5] <- fs3Cov.class_dl@model$validation_metrics@metrics$ logloss
covH[3, 6] <- fs3Cov.class_dl@model$validation_metrics@metrics$ mean_per_class_error
covH Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
1 Random_forest 0.6360936 0.2957424 0.5438220 0.8446302 0.3783833
2 GLM 0.6247904 0.3068551 0.5539450 0.8733787 0.3893232
3 Deep_Learning 0.6237340 0.2963624 0.5443918 0.8637924 0.3912751
6.15 Level 2: Gradient Boosting Model
The basic gradient boosting model, using neural networks, is specified below.
fs3Cov.class_gbm<- h2o.gbm(
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=18:34, ## the predictor columns, by column index
y=2
)6.15.1 Confusion Matrices
Training Sample
The high error rate shown below for the training sample is similar to that of the previous models, hovering in the 35.97% range.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 13353 6628 4933 4103 0.5398 = 15,664 / 29,017
Seg2 3384 34681 4168 11603 0.3558 = 19,155 / 53,836
Seg3 2017 3145 34653 4053 0.2101 = 9,215 / 43,868
Seg4 2237 12725 4421 30891 0.3855 = 19,383 / 50,274
Totals 20991 57179 48175 50650 0.3583 = 63,417 / 176,995
Testing Sample
However, the holdout sample error rate is 36.17%, which is again in the same ball park as the previous models.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 5639 2931 2189 1793 0.5507 = 6,913 / 12,552
Seg2 1455 14727 1782 4974 0.3580 = 8,211 / 22,938
Seg3 833 1306 14738 1850 0.2130 = 3,989 / 18,727
Seg4 906 5561 1927 13120 0.3902 = 8,394 / 21,514
Totals 8833 24525 20636 21737 0.3632 = 27,507 / 75,731
The hit ratio for the testing sample is 63.83%.
$train
Top-4 Hit Ratios:
k hit_ratio
1 1 0.641702
2 2 0.881629
3 3 0.970705
4 4 1.000000
$valid
Top-4 Hit Ratios:
k hit_ratio
1 1 0.636780
2 2 0.878768
3 3 0.970501
4 4 1.000000
6.15.2 Diagnostic statistics
The GBM model is similar to Random Forest in terms of performance across the five metrics. Again, we wouldn’t write home about it.
covH[4, 1] <- "GBM_Boosting"
covH[4, 2] <- fs3Cov.class_gbm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
covH[4, 3] <- fs3Cov.class_gbm@model$validation_metrics@metrics$MSE #
covH[4, 4] <- fs3Cov.class_gbm@model$validation_metrics@metrics$RMSE #
covH[4, 5] <- fs3Cov.class_gbm@model$validation_metrics@metrics$ logloss
covH[4, 6] <- fs3Cov.class_gbm@model$validation_metrics@metrics$ mean_per_class_error
covH Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
1 Random_forest 0.6360936 0.2957424 0.5438220 0.8446302 0.3783833
2 GLM 0.6247904 0.3068551 0.5539450 0.8733787 0.3893232
3 Deep_Learning 0.6237340 0.2963624 0.5443918 0.8637924 0.3912751
4 GBM_Boosting 0.6367802 0.3010183 0.5486513 0.8505309 0.3779716
6.16 Level 2: Naive Bayes Model
Finally, we conducted a Naive Bayes Model.
fs3Cov.class_nB <- h2o.naiveBayes(
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=18:34, ## the predictor columns, by column index
y=2,
laplace = 3)6.16.1 Confusion Matrices
Training Sample
The Naive Bayes model error rate for the training sample is the highest one we’ve seen thus far for traingin samples, at 40.55%.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 11363 6562 7201 3891 0.6084 = 17,654 / 29,017
Seg2 4115 28389 7269 14063 0.4727 = 25,447 / 53,836
Seg3 1736 3057 34845 4230 0.2057 = 9,023 / 43,868
Seg4 2754 10222 6587 30711 0.3891 = 19,563 / 50,274
Totals 19968 48230 55902 52895 0.4050 = 71,687 / 176,995
Testing Sample
Similarly, the Naive Bayes model error rate for the testing sample is the highest one we’ve seen thus far for testing samples, at 40.45%.
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Error Rate
Seg1 4825 2830 3184 1713 0.6156 = 7,727 / 12,552
Seg2 1764 12082 3105 5987 0.4733 = 10,856 / 22,938
Seg3 752 1230 14855 1890 0.2068 = 3,872 / 18,727
Seg4 1126 4427 2770 13191 0.3869 = 8,323 / 21,514
Totals 8467 20569 23914 22781 0.4064 = 30,778 / 75,731
The hit ratio for the testing sample is 59.55%, which is pretty poor.
$train
Top-4 Hit Ratios:
k hit_ratio
1 1 0.594977
2 2 0.847035
3 3 0.959733
4 4 1.000000
$valid
Top-4 Hit Ratios:
k hit_ratio
1 1 0.593588
2 2 0.846021
3 3 0.958511
4 4 1.000000
6.16.2 Comparing the Five Predictive Models’ Diagnostic Statistics
covH[5, 1] <- "naive_Bayes"
covH[5, 2] <- fs3Cov.class_nB@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
covH[5, 3] <- fs3Cov.class_nB@model$validation_metrics@metrics$MSE #
covH[5, 4] <- fs3Cov.class_nB@model$validation_metrics@metrics$RMSE #
covH[5, 5] <- fs3Cov.class_nB@model$validation_metrics@metrics$ logloss
covH[5, 6] <- fs3Cov.class_nB@model$validation_metrics@metrics$ mean_per_class_errorThe Random Forest model is slightly superior than the other models overall, however, the hit ratio is nothing to be proud of. The Naive Bayes model seems to be the worst model.
pacman::p_load(sjPlot)
tab_df(covH, sort.column = -2, show.rownames = FALSE, digits = 3,
title = "Statistics of five Level 2 models of the 6-segment solution")| Prediction_model | hit_ratio | MSE | RMSE | logloss | mean_per_class_error |
|---|---|---|---|---|---|
| GBM_Boosting | 0.637 | 0.301 | 0.549 | 0.851 | 0.378 |
| Random_forest | 0.636 | 0.296 | 0.544 | 0.845 | 0.378 |
| GLM | 0.625 | 0.307 | 0.554 | 0.873 | 0.389 |
| Deep_Learning | 0.624 | 0.296 | 0.544 | 0.864 | 0.391 |
| naive_Bayes | 0.594 | 0.328 | 0.572 | 1.024 | 0.421 |
6.16.3 Graphically Comparing The Hit Ratios
As you can see the hit ratios are all quite similar and poor.
pacman::p_load(reshape2)
# covH[,1:2]
hits_long<- melt(covH[,1:2] ) # need to reshape to 'long' form
# head(hits_long )
pacman::p_load(ggplot2)
ggplot(data=hits_long, aes(x= reorder(Prediction_model, -value), y=value ) ) +
geom_bar( stat="identity" , fill = "pink", width = 0.3) +
geom_point(aes( color= Prediction_model ), size=3) +
theme(axis.text.x = element_text(angle = 0, hjust = 1, size=10)) +
labs( title= "Hit ratios for predictive models using covariates" ,
y= "Proportion correct", x= "Prediction Model") +
ylim(0, 1)6.16.4 Obtaining the segment assignments using the random forest model
Here we are predicting each respondents’ segment based on the fs3Cov.class_rf model (demographic varibales only).
6.17 Combining L1 segments and L2 segments
The segments developed from the 4-segment k-means solution and those predicted based on the demographic variables are combined into a single dataframe below and then saved.
fs4.L1.L2.segs <- h2o.cbind(fs4C.class[,c(1,2)], cov.assign.hex$predict, fs4C.class[,3:34])
#fs6.L1.L2.segs
colnames(fs4.L1.L2.segs)[ c(2,3)] <- c("L1_Segments", "L2_Segments")
fs4.L1.L2.segs[1:6, 1:5] Confidence L1_Segments L2_Segments Tech.Enthu. Ann_Prem_Hm_Insu
1 0.009426962 Seg2 Seg2 0.009739573 0.004403856
2 0.025949382 Seg2 Seg2 0.014750739 0.005865521
3 0.025238024 Seg2 Seg2 0.029873820 0.020744546
4 0.031883370 Seg2 Seg2 0.024817610 0.018098652
5 0.030433902 Seg2 Seg2 0.028479670 0.009048419
6 0.056125319 Seg2 Seg2 0.061331941 0.063795120
[6 rows x 5 columns]
This table shows the coincidence and divergence of prediction using the two models.
L1_Segments L2_Segments Counts
1 Seg1 Seg1 20923
2 Seg1 Seg2 8626
3 Seg1 Seg3 6702
4 Seg1 Seg4 5318
5 Seg2 Seg1 3995
6 Seg2 Seg2 51561
7 Seg2 Seg3 5551
8 Seg2 Seg4 15667
9 Seg3 Seg1 2383
10 Seg3 Seg2 3846
11 Seg3 Seg3 51096
12 Seg3 Seg4 5270
13 Seg4 Seg1 2575
14 Seg4 Seg2 16415
15 Seg4 Seg3 6171
16 Seg4 Seg4 46627
[16 rows x 3 columns]
6.18 Visualizing & Summarizing customers in segments
6.18.1 Table of Attribute Attitudes and Segments
The table below lists the means for each attitudinal variable for those in each of the 4 Level 1 Segments.
Top 3 most important variables in each segment:
Segment 1 Top 3:
- Trust Financial/Banking Applications
- Technology Enthusiast Per Household
- Consent in sharing Mobile Information per HouseHold
Segment 2 Top 3:
- Trust Financial/Banking Applications
- Annual Premiums Paid for Home Insurance per HouseHold
- Technology Enthusiast People Per Household
Segment 3 Top 3:
- Trust Financial/Banking Applications
- Technology Enthusiast Per Household
- Smart Thermostat Owners per HouseHold & Confidence in Big Business per HouseHold
Segment 4 Top 3:
- Trust Financial/Banking Applications
- Annual Premiums Paid for Home Insurance Per HouseHold
- Smart Home Assistant Owners per HouseHold
library("dplyr")
library("kableExtra")
Customer_Data <- read.csv("C:/Jaspreet Marketing Analytics Project/FS.4Seg.L1.L2_16dec20.csv")
library(psych)
library(data.table)
Customer_Data %>%
mutate(Group = as.factor(L1_Segments)) %>%
group_by(Group) %>%
summarize(Avg_er = round(mean(Confidence),2),
Avg_ps = round(mean(Tech.Enthu.),2),
Avg_fo = round(mean(Ann_Prem_Hm_Insu),2),
Avg_tp = round(mean(Mob_Mark_cons),2),
Avg_sc = round(mean(Mob_Inf_sec_cons),2),
Avg_sr = round(mean(X..Smart_purch),2),
Avg_os = round(mean(X.Smt_Therm),2),
Avg_ca = round(mean(X.Smt_Hm_Secu),2),
Avg_c = round(mean(X.Smt_Hm_Ass),2),
Avg_f = round(mean(Smt_Hm_Dev),2),
Avg_h = round(mean(X.Smat_Therm.Pln),2),
Avg_i = round(mean(X.Smt_Hm_Sec.Pln),2),
Avg_j = round(mean(X.Smt_Hm_Asst.Pln),2),
Avg_k = round(mean(X.Smt_Hm_Dev.Pln),2),
Avg_l = round(mean(Fin_App_Trst),2),
Count_of_Members = n()
) %>%
arrange(Group) %>%
transpose() -> cd
colnames(cd) <- cd[1,]
cd <- cd[-1,]
cd$order <- 1:nrow(cd)
cd$order [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
rownames(cd)[1:15] <- colnames(Customer_Data[c(1,4:17)])
rownames(cd)[16] <- c("Segment_Size")
# cd$variable <- rownames(cd)
cd$variable <- c("Confidence in Big Business per HouseHold- RBC",
"Technology Enthusiast People per HouseHold",
"Annual Premiums Paid for Home Insurance per HouseHold",
"Consent in Mobile Marketing per HouseHold",
"Consent in sharing Mobile Information per HouseHold",
"Per HouseHold people purchasing through smartphones",
"Smart Thermostat Owners per HouseHold",
"Smart Home Security Owners per HouseHold",
"Smart Home Assistant Owners per HouseHold",
"Smart Home Devices Owners per HouseHold",
"People planning to buy Smart Thermostat per HouseHold",
"People planning to buy Smart Home Security devices per HouseHold",
"People planning to buy Smart Home Assistant per HouseHold",
"People planning to buy Smart Home Devices per HouseHold",
"People having Financial Application Trust per HouseHold",
"Segment Size")
cd <- cd[, c(6, 1:5)]
cd[,2:5] <- lapply(cd[,2:5], function(x) as.numeric(as.character(x)))
cd[1:15, 1:5] %>%
arrange(variable) %>%
mutate_if(is.numeric, function(x) {
cell_spec(x, bold = T,
color = spec_color(x, end = 0.9),
font_size = spec_font_size(x))
}) %>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = T, position = "left")| variable | Seg1 | Seg2 | Seg3 | Seg4 |
|---|---|---|---|---|
| Annual Premiums Paid for Home Insurance per HouseHold | 0.16 | 0.22 | 0.18 | 0.29 |
| Confidence in Big Business per HouseHold- RBC | 0.2 | 0.2 | 0.27 | 0.23 |
| Consent in Mobile Marketing per HouseHold | 0.14 | 0.12 | 0.15 | 0.13 |
| Consent in sharing Mobile Information per HouseHold | 0.21 | 0.16 | 0.19 | 0.19 |
| People having Financial Application Trust per HouseHold | 0.46 | 0.39 | 0.43 | 0.47 |
| People planning to buy Smart Home Assistant per HouseHold | 0.08 | 0.06 | 0.11 | 0.08 |
| People planning to buy Smart Home Devices per HouseHold | 0.1 | 0.08 | 0.13 | 0.1 |
| People planning to buy Smart Home Security devices per HouseHold | 0.1 | 0.09 | 0.16 | 0.11 |
| People planning to buy Smart Thermostat per HouseHold | 0.09 | 0.08 | 0.11 | 0.1 |
| Per HouseHold people purchasing through smartphones | 0.19 | 0.11 | 0.18 | 0.13 |
| Smart Home Assistant Owners per HouseHold | 0.19 | 0.17 | 0.26 | 0.21 |
| Smart Home Devices Owners per HouseHold | 0.11 | 0.09 | 0.16 | 0.12 |
| Smart Home Security Owners per HouseHold | 0.08 | 0.06 | 0.13 | 0.07 |
| Smart Thermostat Owners per HouseHold | 0.15 | 0.15 | 0.27 | 0.19 |
| Technology Enthusiast People per HouseHold | 0.26 | 0.18 | 0.25 | 0.21 |
Although the L2 segmentation wasn’t that impressive, we went ahead and looked at segments in further granularity to see if we could pick anything up that it interesting and could add value.
6.18.2 Correspondence Analysis of Attributes and Segments
Seg1 Seg2 Seg3 Seg4
Confidence 0.20 0.20 0.27 0.23
Tech.Enthu. 0.26 0.18 0.25 0.21
Ann_Prem_Hm_Insu 0.16 0.22 0.18 0.29
Mob_Mark_cons 0.14 0.12 0.15 0.13
Mob_Inf_sec_cons 0.21 0.16 0.19 0.19
X..Smart_purch 0.19 0.11 0.18 0.13
X.Smt_Therm 0.15 0.15 0.27 0.19
X.Smt_Hm_Secu 0.08 0.06 0.13 0.07
X.Smt_Hm_Ass 0.19 0.17 0.26 0.21
Smt_Hm_Dev 0.11 0.09 0.16 0.12
X.Smat_Therm.Pln 0.09 0.08 0.11 0.10
X.Smt_Hm_Sec.Pln 0.10 0.09 0.16 0.11
X.Smt_Hm_Asst.Pln 0.08 0.06 0.11 0.08
X.Smt_Hm_Dev.Pln 0.10 0.08 0.13 0.10
Fin_App_Trst 0.46 0.39 0.43 0.47
c <- CA(cd.m, graph=FALSE)
plot(c, title="Correspondence Analysis of Attributes and Segments", col.main="blue" )6.18.3 Interpreting the Segments Using Demographic Data
The several tables that follow provide the row and column percents for each level of each demographic for the 4 Level 1 segments. While prediction using the demographic variables has been shown to be very poor, there may be some insights that could help to better describe the segments.
The Chi-square statistics beneath each table provide some indication of whether the table shows any significant relationship between the variables and the segments.
The blue/purple numbers represent the across-segment percentages The green numbers represent the within-segment percentages.
6.18.4 Segments by Total Population Age
- Segment 1: 96% of them are within the age bracket of 26-64.
- Segment 2: 58% of them are within the age bracket of 40-64.
- Segment 3: 52% of them are within the age bracket of 40-64.
- Segment 4: Skewing older…78.3% are 40-64.
library(sjPlot)
sjt.xtab(Customer_Data$Pop_Age, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| Pop_Age | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
| 0-25 Years |
864 18.4 % 2.1 % |
2827 60.2 % 3.7 % |
773 16.5 % 1.2 % |
234 5 % 0.3 % |
4698 100 % 1.9 % |
| 26-50 Years |
36608 18 % 88.1 % |
62481 30.8 % 81.4 % |
55280 27.2 % 88.3 % |
48713 24 % 67.9 % |
203082 100 % 80.4 % |
| 51-75 Years |
4084 9.2 % 9.8 % |
11165 25.2 % 14.5 % |
6471 14.6 % 10.3 % |
22573 51 % 31.4 % |
44293 100 % 17.5 % |
| More than 76 Years |
13 2 % 0 % |
301 46.1 % 0.4 % |
71 10.9 % 0.1 % |
268 41 % 0.4 % |
653 100 % 0.3 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=16429.159 · df=9 · Cramer’s V=0.147 · p=0.000 |
6.18.5 Segments by Male Population Age
sjt.xtab(Customer_Data$M_Pop_Age, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| M_Pop_Age | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
| 0-25 Years |
1472 13.7 % 3.5 % |
5272 49 % 6.9 % |
1803 16.8 % 2.9 % |
2207 20.5 % 3.1 % |
10754 100 % 4.3 % |
| 26-50 Years |
36231 18.2 % 87.2 % |
59873 30 % 78 % |
54143 27.1 % 86.5 % |
49201 24.7 % 68.5 % |
199448 100 % 78.9 % |
| 51-75 Years |
3837 9.2 % 9.2 % |
11346 27.2 % 14.8 % |
6543 15.7 % 10.5 % |
20037 48 % 27.9 % |
41763 100 % 16.5 % |
| More than 76 Years |
29 3.8 % 0.1 % |
283 37.2 % 0.4 % |
106 13.9 % 0.2 % |
343 45.1 % 0.5 % |
761 100 % 0.3 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=12301.018 · df=9 · Cramer’s V=0.127 · p=0.000 |
6.18.6 Segments by Female Population Age
#__THE LEVELS WILL NOT PRINT IN ORDER UNLESS WE PHYSICALLY ORDER THEM
library(forcats)
Customer_Data$F_Pop_Age <- fct_relevel(Customer_Data$F_Pop_Age)
levels(Customer_Data$F_Pop_Age)[1] "0-25 Years" "26-50 Years" "51-75 Years"
[4] "More than 76 Years"
sjt.xtab(Customer_Data$F_Pop_Age, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| F_Pop_Age | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
| 0-25 Years |
5284 17.5 % 12.7 % |
10925 36.2 % 14.2 % |
5898 19.6 % 9.4 % |
8057 26.7 % 11.2 % |
30164 100 % 11.9 % |
| 26-50 Years |
29364 18 % 70.6 % |
48605 29.8 % 63.3 % |
47324 29 % 75.6 % |
37741 23.1 % 52.6 % |
163034 100 % 64.5 % |
| 51-75 Years |
6199 11.5 % 14.9 % |
15377 28.4 % 20 % |
8427 15.6 % 13.5 % |
24110 44.6 % 33.6 % |
54113 100 % 21.4 % |
| More than 76 Years |
722 13.3 % 1.7 % |
1867 34.5 % 2.4 % |
946 17.5 % 1.5 % |
1880 34.7 % 2.6 % |
5415 100 % 2.1 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=11718.907 · df=9 · Cramer’s V=0.124 · p=0.000 |
6.18.7 Segments by Private Household
Overall, it appears that the majority of people in each segments live in 3-4 family households.
- Segment 1: 60% of this segment lives in 3-4 person households.
- Segment 2: 80% of this segment lives in 3-4 person households.
- Segment 3: 72.3% of this segment lives in 3-4 person households.
- Segment 4: 73.1% of this segment lives in 3-4 person households. Roughly half of the 0-2 person households who were surveyed fall into Segment 4.
#__THE LEVELS WILL NOT PRINT IN ORDER UNLESS WE PHYSICALLY ORDER THEM
library(forcats)
Customer_Data$Pvt_HHld <- fct_relevel(Customer_Data$Pvt_HHld)
levels(Customer_Data$Pvt_HHld)[1] "0-2 Persons Private Households" "3-4 Persons Private Households"
[3] "5-6 Persons Private Households" "7-8 Persons Private Households"
[5] "8+ Persons Private Households"
sjt.xtab(Customer_Data$Pvt_HHld, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| Pvt_HHld | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
|
0-2 Persons Private Households |
10220 26.1 % 24.6 % |
9157 23.4 % 11.9 % |
542 1.4 % 0.9 % |
19257 49.2 % 26.8 % |
39176 100 % 15.5 % |
|
3-4 Persons Private Households |
24947 13.5 % 60 % |
61598 33.4 % 80.2 % |
45277 24.6 % 72.3 % |
52490 28.5 % 73.1 % |
184312 100 % 72.9 % |
|
5-6 Persons Private Households |
4461 20.6 % 10.7 % |
4698 21.7 % 6.1 % |
12416 57.4 % 19.8 % |
41 0.2 % 0.1 % |
21616 100 % 8.6 % |
|
7-8 Persons Private Households |
1079 25.5 % 2.6 % |
1039 24.6 % 1.4 % |
2113 49.9 % 3.4 % |
0 0 % 0 % |
4231 100 % 1.7 % |
|
8+ Persons Private Households |
862 25.4 % 2.1 % |
282 8.3 % 0.4 % |
2247 66.3 % 3.6 % |
0 0 % 0 % |
3391 100 % 1.3 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=41657.517 · df=12 · Cramer’s V=0.234 · p=0.000 |
6.18.8 Segments by Family Household
Unfortunately, this variable did not yield helpful results that can be interpreted. It is on a per-postal-code basis and if we divided by the total population, the numbers were less than 1.
#__THE LEVELS WILL NOT PRINT IN ORDER UNLESS WE PHYSICALLY ORDER THEM
library(forcats)
Customer_Data$Fam_Hhld <- fct_relevel(Customer_Data$Fam_Hhld)
levels(Customer_Data$Fam_Hhld)[1] "0-5 Family Households" "11-15 Family Households"
[3] "16-20 Family Households" "21-25 Family Households"
[5] "26-30 Family Households" "31-35 Family Households"
[7] "36-40 Family Households" "40+ Family Households"
[9] "6-10 Family Households"
sjt.xtab(Customer_Data$Fam_Hhld, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| Fam_Hhld | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
|
0-5 Family Households |
19649 19.9 % 47.3 % |
30264 30.7 % 39.4 % |
19895 20.2 % 31.8 % |
28836 29.2 % 40.2 % |
98644 100 % 39 % |
|
11-15 Family Households |
4200 13.9 % 10.1 % |
9030 29.8 % 11.8 % |
7997 26.4 % 12.8 % |
9031 29.8 % 12.6 % |
30258 100 % 12 % |
|
16-20 Family Households |
2418 15.7 % 5.8 % |
4525 29.4 % 5.9 % |
4194 27.2 % 6.7 % |
4272 27.7 % 6 % |
15409 100 % 6.1 % |
|
21-25 Family Households |
1559 13 % 3.8 % |
3496 29.2 % 4.6 % |
3766 31.5 % 6 % |
3132 26.2 % 4.4 % |
11953 100 % 4.7 % |
|
26-30 Family Households |
1127 10.3 % 2.7 % |
2990 27.3 % 3.9 % |
3981 36.4 % 6.4 % |
2848 26 % 4 % |
10946 100 % 4.3 % |
|
31-35 Family Households |
764 8 % 1.8 % |
2101 22.1 % 2.7 % |
4705 49.4 % 7.5 % |
1950 20.5 % 2.7 % |
9520 100 % 3.8 % |
|
36-40 Family Households |
543 10.2 % 1.3 % |
941 17.7 % 1.2 % |
3131 58.8 % 5 % |
707 13.3 % 1 % |
5322 100 % 2.1 % |
|
40+ Family Households |
2239 21 % 5.4 % |
4235 39.7 % 5.5 % |
1809 16.9 % 2.9 % |
2395 22.4 % 3.3 % |
10678 100 % 4.2 % |
|
6-10 Family Households |
9070 15.1 % 21.8 % |
19192 32 % 25 % |
13117 21.9 % 21 % |
18617 31 % 25.9 % |
59996 100 % 23.7 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=10856.938 · df=24 · Cramer’s V=0.120 · p=0.000 |
6.18.9 Segments by Non-Family Household
Unfortunately, this variable did not yield helpful results that can be interpreted. It is on a per-postal-code basis and if we divided by the total population, the numbers were less than 1.
sjt.xtab(Customer_Data$NF_Hhld, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| NF_Hhld | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
|
0-1 Non-Family Households |
11028 12.3 % 26.5 % |
24734 27.6 % 32.2 % |
31633 35.3 % 50.5 % |
22204 24.8 % 30.9 % |
89599 100 % 35.5 % |
|
2-3 Non-Family Households |
8666 11.6 % 20.8 % |
22434 30 % 29.2 % |
19544 26.2 % 31.2 % |
24030 32.2 % 33.5 % |
74674 100 % 29.5 % |
|
4-5 Non-Family Households |
5804 16.3 % 14 % |
11388 31.9 % 14.8 % |
6783 19 % 10.8 % |
11741 32.9 % 16.4 % |
35716 100 % 14.1 % |
|
6-7 Non-Family Households |
3866 22.3 % 9.3 % |
5621 32.5 % 7.3 % |
2348 13.6 % 3.8 % |
5473 31.6 % 7.6 % |
17308 100 % 6.8 % |
|
9+ Non-Family Households |
12205 34.4 % 29.4 % |
12597 35.6 % 16.4 % |
2287 6.5 % 3.7 % |
8340 23.5 % 11.6 % |
35429 100 % 14 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=21567.022 · df=12 · Cramer’s V=0.169 · p=0.000 |
6.18.10 Segments by Average Child per census family household
Overall, 50% of households have 2 children across all segments.
- Segment 1: 90% of this segment has 1-2 children/ household.
- Segment 3: 75.2% of this segment has 2 children/ household.
- Segment 4: 68.5% of this segment has 1 children/ household.
sjt.xtab(Customer_Data$Avg_Chld_PrCen_Fm_Hhld, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| Avg_Chld_PrCen_Fm_Hhld | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
|
0-1 Children Per Household |
20601 18.1 % 49.6 % |
31119 27.4 % 40.5 % |
11154 9.8 % 17.8 % |
50720 44.7 % 70.7 % |
113594 100 % 44.9 % |
|
2 Children Per Household |
18401 14.6 % 44.3 % |
39694 31.5 % 51.7 % |
47077 37.3 % 75.2 % |
20897 16.6 % 29.1 % |
126069 100 % 49.9 % |
|
3 Children Per Household |
2217 19.9 % 5.3 % |
5091 45.8 % 6.6 % |
3644 32.8 % 5.8 % |
166 1.5 % 0.2 % |
11118 100 % 4.4 % |
|
4 Children Per Household |
304 17.8 % 0.7 % |
827 48.4 % 1.1 % |
571 33.5 % 0.9 % |
5 0.3 % 0 % |
1707 100 % 0.7 % |
|
5 or More Children Per Household |
46 19.3 % 0.1 % |
43 18.1 % 0.1 % |
149 62.6 % 0.2 % |
0 0 % 0 % |
238 100 % 0.1 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=40909.129 · df=12 · Cramer’s V=0.232 · p=0.000 |
6.18.11 Segments by Houses
- Segment 1: Lowest likelihood of owning a house.
- Segment 2: Second highest likelihood of owning a house.
- Segment 4: Has the highest likelihood of owning a house.
library(forcats)
Customer_Data$Houses <- fct_relevel(as.factor(Customer_Data$Houses))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Houses, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| Houses | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
|
0-25% Living in Houses |
29500 31.8 % 71 % |
22999 24.8 % 30 % |
31883 34.4 % 50.9 % |
8303 9 % 11.6 % |
92685 100 % 36.7 % |
|
26-50% Living in Houses |
12007 7.7 % 28.9 % |
53300 34.2 % 69.4 % |
30708 19.7 % 49.1 % |
59946 38.4 % 83.5 % |
155961 100 % 61.7 % |
|
51-75% Living in Houses |
60 1.6 % 0.1 % |
468 12.6 % 0.6 % |
4 0.1 % 0 % |
3191 85.7 % 4.4 % |
3723 100 % 1.5 % |
|
More than 76% Living in Houses |
2 0.6 % 0 % |
7 2 % 0 % |
0 0 % 0 % |
348 97.5 % 0.5 % |
357 100 % 0.1 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=52175.616 · df=9 · Cramer’s V=0.262 · p=0.000 |
6.18.12 Segments by Apartments
- Segment 1: Most likely to rent an apartment.
- Segment 2: Extremely unlikely to rent an apartment.
- Segment 3: Extremely unlikely to rent an apartment.
- Segment 4: Extremely unlikely to rent an apartment.
library(forcats)
Customer_Data$Apt <- fct_relevel(as.factor(Customer_Data$Apt))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Apt, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| Apt | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
|
0-25% Living in Apartments |
28628 12.5 % 68.9 % |
71037 31.1 % 92.5 % |
61947 27.1 % 99 % |
66670 29.2 % 92.9 % |
228282 100 % 90.3 % |
|
26-50% Years Living in Apartments |
9468 56.9 % 22.8 % |
4049 24.3 % 5.3 % |
325 2 % 0.5 % |
2797 16.8 % 3.9 % |
16639 100 % 6.6 % |
|
51-75% Living in Apartments |
3170 47.4 % 7.6 % |
1527 22.8 % 2 % |
261 3.9 % 0.4 % |
1736 25.9 % 2.4 % |
6694 100 % 2.6 % |
|
More than 76% Living in Apartments |
303 27.3 % 0.7 % |
161 14.5 % 0.2 % |
62 5.6 % 0.1 % |
585 52.7 % 0.8 % |
1111 100 % 0.4 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=29540.281 · df=9 · Cramer’s V=0.197 · p=0.000 |
6.18.13 Segments by Condos
- Segment 1: Very unlikely to own/rent a condo.
- Segment 2: Very unlikely to own/rent a condo.
- Segment 3: Very unlikely to own/rent a condo.
- Segment 4: Very unlikely to own/rent a condo.
library(forcats)
Customer_Data$Condos <- fct_relevel(as.factor(Customer_Data$Condos))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Condos, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| Condos | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
|
0-25% Living in Condos |
40549 16.4 % 97.5 % |
75443 30.5 % 98.3 % |
61905 25.1 % 98.9 % |
69108 28 % 96.3 % |
247005 100 % 97.7 % |
|
26-50% Years Living in Condos |
966 21.5 % 2.3 % |
1263 28.2 % 1.6 % |
418 9.3 % 0.7 % |
1837 41 % 2.6 % |
4484 100 % 1.8 % |
|
51-75% Living in Condos |
54 4.8 % 0.1 % |
68 6.1 % 0.1 % |
246 22.1 % 0.4 % |
747 67 % 1 % |
1115 100 % 0.4 % |
|
More than 76% Living in Condos |
0 0 % 0 % |
0 0 % 0 % |
26 21.3 % 0 % |
96 78.7 % 0.1 % |
122 100 % 0 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=1847.712 · df=9 · Cramer’s V=0.049 · p=0.000 |
6.18.14 Segments by Household Income
When we normalized the data by the population of each postal code, this is the output.
6.18.15 Segments by High School Certificate Or Equivalent
Most people in all segments are likely to have an education level higher than solely highs school diploma.
library(forcats)
Customer_Data$High.School <- fct_relevel(as.factor(Customer_Data$High.School))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$High.School, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| High.School | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
|
0-25% completed High School |
23471 15.5 % 56.5 % |
42464 28.1 % 55.3 % |
45233 29.9 % 72.3 % |
40160 26.5 % 55.9 % |
151328 100 % 59.9 % |
|
26-50% completed High School |
17845 17.7 % 42.9 % |
34246 34 % 44.6 % |
17251 17.1 % 27.6 % |
31398 31.2 % 43.7 % |
100740 100 % 39.9 % |
|
51-75% completed High School |
224 36.9 % 0.5 % |
63 10.4 % 0.1 % |
107 17.6 % 0.2 % |
213 35.1 % 0.3 % |
607 100 % 0.2 % |
|
More than 76% completed High School |
29 56.9 % 0.1 % |
1 2 % 0 % |
4 7.8 % 0 % |
17 33.3 % 0 % |
51 100 % 0 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=5643.164 · df=9 · Cramer’s V=0.086 · p=0.000 |
6.18.16 Segments by College, CEGEP Or Other Non-University Certificate Or Diploma
Most people in all segments are likely to have an education level higher than solely a college/CEGEP diploma.
library(forcats)
Customer_Data$Clg_CEGEP_Non_Uni_Dip <- fct_relevel(as.factor(Customer_Data$Clg_CEGEP_Non_Uni_Dip))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Clg_CEGEP_Non_Uni_Dip, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| Clg_CEGEP_Non_Uni_Dip | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
|
0-25% completed Non-University Diploma |
36392 17.2 % 87.5 % |
62219 29.3 % 81 % |
58918 27.8 % 94.1 % |
54637 25.8 % 76.1 % |
212166 100 % 84 % |
|
26-50% completed Non-University Diploma |
5166 12.8 % 12.4 % |
14543 35.9 % 18.9 % |
3658 9 % 5.8 % |
17091 42.2 % 23.8 % |
40458 100 % 16 % |
|
51-75% completed Non-University Diploma |
11 13.1 % 0 % |
12 14.3 % 0 % |
17 20.2 % 0 % |
44 52.4 % 0.1 % |
84 100 % 0 % |
|
More than 76% completed Non-University Diploma |
0 0 % 0 % |
0 0 % 0 % |
2 11.1 % 0 % |
16 88.9 % 0 % |
18 100 % 0 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=9010.909 · df=9 · Cramer’s V=0.109 · Fisher’s p=0.000 |
6.18.17 Segments by University Certificate Or Diploma Below Bachelor
- Segment 2: Is the least likely to to have a university degree as their highest level of education.
- Segment 3: Is the most likely to have a university degree as their highest level of education. This is the most educated group.
library(forcats)
Customer_Data$Uni_Dip_Blw_Bach <- fct_relevel(as.factor(Customer_Data$Uni_Dip_Blw_Bach))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Uni_Dip_Blw_Bach, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| Uni_Dip_Blw_Bach | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
|
0-25% completed University Degree |
22845 15.2 % 55 % |
62315 41.4 % 81.2 % |
21317 14.2 % 34.1 % |
44108 29.3 % 61.4 % |
150585 100 % 59.6 % |
|
26-50% completed University Degree |
14368 16.4 % 34.6 % |
13575 15.5 % 17.7 % |
35135 40.2 % 56.1 % |
24343 27.8 % 33.9 % |
87421 100 % 34.6 % |
|
51-75% completed University Degree |
4268 30.3 % 10.3 % |
852 6.1 % 1.1 % |
5834 41.4 % 9.3 % |
3128 22.2 % 4.4 % |
14082 100 % 5.6 % |
|
More than 76% completed University Degree |
88 13.8 % 0.2 % |
32 5 % 0 % |
309 48.4 % 0.5 % |
209 32.8 % 0.3 % |
638 100 % 0.3 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=34233.449 · df=9 · Cramer’s V=0.212 · p=0.000 |
6.18.18 Segments by Employed
- Segment 3: Most likely to be employed.
- Segment 4: Most likely to be employed.
library(forcats)
Customer_Data$UnEmp <- fct_relevel(as.factor(Customer_Data$Empl))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$UnEmp, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| UnEmp | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
| 0-25% Employed |
1449 13.3 % 3.5 % |
6079 55.9 % 7.9 % |
786 7.2 % 1.3 % |
2568 23.6 % 3.6 % |
10882 100 % 4.3 % |
| 26-50% Employed |
15319 14.6 % 36.9 % |
39809 38.1 % 51.9 % |
23601 22.6 % 37.7 % |
25884 24.7 % 36.1 % |
104613 100 % 41.4 % |
| 51-75% Employed |
22440 17.7 % 54 % |
30141 23.8 % 39.3 % |
35055 27.6 % 56 % |
39187 30.9 % 54.6 % |
126823 100 % 50.2 % |
|
More than 76% Employed |
2361 22.7 % 5.7 % |
745 7.2 % 1 % |
3153 30.3 % 5 % |
4149 39.9 % 5.8 % |
10408 100 % 4.1 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=12116.042 · df=9 · Cramer’s V=0.126 · p=0.000 |
6.18.19 Segments by UnEmployed
The data from this chart did not yield optimal results and was therefore not analyzed.
6.18.20 Segments by Total Immigrant
- Segment 3: Majority are likely to be immigrants.
library(forcats)
Customer_Data$Tot_Mig <- fct_relevel(as.factor(Customer_Data$Tot_Mig))
# levels(Customer_Data$Income)
sjt.xtab(Customer_Data$Tot_Mig, Customer_Data$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)| Tot_Mig | L1_Segments | Total | |||
|---|---|---|---|---|---|
| Seg1 | Seg2 | Seg3 | Seg4 | ||
| 0-25% Immigrants |
25142 16.3 % 60.5 % |
61611 40 % 80.2 % |
12607 8.2 % 20.1 % |
54480 35.4 % 75.9 % |
153840 100 % 60.9 % |
| 26-50% Immigrants |
12990 19.4 % 31.2 % |
13363 20 % 17.4 % |
25005 37.4 % 39.9 % |
15491 23.2 % 21.6 % |
66849 100 % 26.5 % |
| 51-75% Immigrants |
3330 10.7 % 8 % |
1724 5.5 % 2.2 % |
24355 78.2 % 38.9 % |
1750 5.6 % 2.4 % |
31159 100 % 12.3 % |
|
More than 76% Immigrants |
107 12.2 % 0.3 % |
76 8.7 % 0.1 % |
628 71.5 % 1 % |
67 7.6 % 0.1 % |
878 100 % 0.3 % |
| Total |
41569 16.4 % 100 % |
76774 30.4 % 100 % |
62595 24.8 % 100 % |
71788 28.4 % 100 % |
252726 100 % 100 % |
χ2=81727.778 · df=9 · Cramer’s V=0.328 · p=0.000 |
6.19 Finalized Segments
Segment 1: Open-Minded Renters
These individuals (26-64) are the more tech enthusiastic out of all renters, however their smart tech adoption is medium, having smart phones, smart thermostats, and smart home assistants. Although they are excited about technology, they are not likely to purchase smart home technologies in the future. This could be because there is not enough incentive for them to do so (e.g.lowering insurance premiums, simplifying their life, etc.). There is also a huge opportunity to educate these individuals on the ease of use of these technologies as they are middle-aged and live busy lives with most having one to two children in an apartment. While they don’t plan on owning smart home technology, they are considered to be open-minded individuals as they are the most willing segment to share their mobile information, are enthusiastic about technology in general, trust banking/financial apps, and are therefore likely to convert. Overall, they are technologically-enabled, but need that extra push to continue to purchase more smart home technology. Location : Scattered all across Ontario
Segment 2: Low-tech Homeowners
This segment owns homes and falls in the 40-64 age bracket, and are not only slow to adopting smart home technology, but don’t plan on purchasing it in the future. They are home owners and live in 3-4 person households and given that they have medium-to-high home insurance premiums, there is an opportunity to educate them on the impact that smart home technology could have on their premiums. They are not tech enthusiasts, but they currently own smart phone and smart home assistants. Interestingly, they are most likley to consent to mobile information sharing. Location : Northern and Southern Ontario
Segment 3: Migrant Technophiles
These individuals are very excited by technology. They own the most amount of smart technologies, including: smart phone, smart home thermostat, smart home security system, smart home assistant, smart home appliances and lighting. They are also likely to purchase more smart home technology in the future (specifically smart home assistants). Most have 2 children per household, are employed and have a university degrees as their highest level of education (the highest amongst the 4 segments). Notably, they have the highest combined household income of roughly $148,000. They are also the most confident in big businesses as well as banking/financial applications, which is a plus for RBCI. Location : Southern Ontario
Segment 4: Relaxed Retirees
This segment is 65+ and pays high insurance premiums, but is slow to smart home technology adoption. Most have a smart thermostat and a smart home assistant, but don’t plan on purchasing more smart home technology in the future. These individuals (or their children) need a lot of education in order for them to become more technologically-enabled. Considering that they have the highest insurance premiums the relaxed retirees, emphasize opportunities to decrease insurance premiums with the adoption of certain smart home technologies. They are also confident in big businesses, which is aplus for RBCI. Location : In or surounsding cities (non-rural)
7 Conclusions and Marketing Messaging Considerations
Our messaging to the 4 segments will depend on the key benefits derived from Smart home Technologies. We have narrowed the benefits of Smart home technology to five major benefits:
- Increased security: Smart home technologies can reduce the risk of fire or theft through built-in alert systems. This can be really valuable to people for whom security is the biggest driver
- Reduced costs: Installing smart home devices may reduce insurance premiums, similar to how installing similar devices in vehicles can reduce car insurance premiums. This aspect of smart home technology will be particularly appealing to people who have low income, or are paying high homeowners insurance premiums.
- Increased convenience: An ancillary benefit of installing these technologies will be the ability
- Improved quality-of-life: For people living alone, especially seniors, smart devices can literally save lives. A smart speaker might be able to bring emergency services to the side of a person who just fell down. Improved connectivity will also help with mental health; people who can’t connect with their loved ones, especially under the current circumstances, might hasten adoption of smart home technologies.
Factors influencing the purchase of smart devices
- Peer pressure: We may be on the edge of a cultural shift similar to the one in the early 2000’s when cell phones went from being a nice-to-have to a must-have. While the immediate pull will likely be slower due to the lower visibility of these items, they may eventually become hallmarks of social progress.
- Concerns around Data Privacy: With more and more companies using and selling data, people may be hesitant to trust multi-billion dollar conglomerates with their most sensitive data, including voice and behavioural data.
In conclusion, we have identified 4 segments: Open-minded Renters, Low-tech Homeowners, Migrant Technophiles, Relaxed Retirees. All segments should be considered for targeting and marketing considerations are elaborated upon above.
Considering the messaging/ value propositions mentioned above, this is how the marketing strategy and specifically, the messaging would be communicated to each segment:
Segment 1: Open-Minded Renters
- Marketing Messaging: Reduced costs (premiums), increased convenience
- Marketing Channel: Mobile marketing
Segment 2: Low-tech Homeowners
- Marketing Messaging: Reduced costs (premiums), increased convenience
- Marketing Channel: Traditional Marketing (radio, television, telemarketing)
Segment 3: Migrant Technophiles
- Marketing Messaging: Increased security, improved QoL
- Marketing Channel: Mobile marketing
Segment 4: Relaxed Retirees
- Marketing Messaging: Reduced costs (premiums), improved QoL
- Marketing Channel: Traditional Marketing (radio, television, telemarketing)
Some suggestions for partnerships that could resultin an offering to clients, of discounts on premiums, free installation, discounts on the actual tech products, include:
Google Nest : Nest smart thermostats, smoke & CO detectors. Benefits: Mitigates risk of fire damage, mitigates risk for damage due to extreme home temperatures
Amazon Ring : Smart doorbells and smart flying security drone. Benefits: Mitigates risk of home damage done by intruders
Amazon Echo : Smart speaker and smart plugs. Benefits: Mitigates risk of small household appliances from staying on for too long and causing home damage
General Electric : Smart appliances. Benefits: Mitigates risk of damage due to, for example, stove/oven being on all day (can shut it off via phone)
6.2.1 Social Values
The code-chunk below extracts the high-potential basis variables into a smaller file.
Since these files are so large, purge them from active duty after extracting the key variables.